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PREFACE 


The subject of system identification is too broad to be covered completely in one book. This document 
is restricted to statistical system identification; that is, methods derived from probabilistic mathematical 
statements of the problem. We will be primarily interested in maxi mum- likelihood and related estimators. 
Statistical methods are becoming increasingly important with the proliferation of high-speed, general-purpose 
digital computers. Problems that were once solved by hand-pitting the data and drawing a line through them 
are now done by telling a computer to fit the best line through the data (or by some completely different, 
formerly impractical method). Statistical approaches to system identification are well -suited to computer 
application. 

Automated statistical algorithms can solve more complicated problems more rapidly- and sometimes more 
accurately- than the older manual methods. There is a danger, however, of the engineer's losing the intuitive 
feel for the system that arises from long hours of working closely with the data. To use statistical estima- 
tion algorithms effectively, the engineer must have not only a good grasp of the system under analysis, but 
also a thorough understanding of the analytic tools used. The analyst must strive to understand how the 
system behaves and what characteristics of the data influence the statistical estimators in order to evaluate 
the validity and meaning of the results. 

Our primary aim in this document is to provide the practicing data analyst with the background necessary 
to make effective use of statistical system identification techniques, particularly maximum-1 ike 1 i hood and 
related estimators. The intent is to present the theory in a manner that aids intuitive understanding at a 
concrete level useful in application. Theoretical rigor has not been sacrificed, but we have tried to avoid 
"elegant" proofs that may require three lines to wrice, but 3 years of study to comprehend the underlying 
theory. In particular, such theoretically intriguing subjects as martingales and measure theory are ignored. 
Several excellent volumes on these subjects are available, including Balakrishnan (1973), Royden (1968), Rudin 
(1974), and Kushner (1971). 

We assume that the reader has a thorough background in linear algebra and calculus (Paige, Swift, and 
Slobko, 1974; Apostol , 1969; Nering, 1969; and Wilkinson, 1965), including complete familiarity with matrix 
operations, vector spaces, inner products, norms, gradients, eigenvalues, and related subjects. The reader 
should be familiar with the concept of function spaces as types of abstract vector spaces (Luenberger, 1969), 
but does not need expertise in functional analysis. We also assume familiarity with concepts of deterministic 
dynamic systems (Zadeh and Desoer, 1963; Wiberg, 1971; and Levan, 1983). 

Chapter 1 introduces the basic concepts of system identification. Chapter 2 is an introduction to numeri- 
cal optimization methods, which are important to system identification. Chapter 3 reviews basic concepts from 
probability theory. The treatment is necessarily abbreviated, and previous familiarity with probability 
theory is assumed. 

Chapters 4-10 present the body of the theory. Chapter 4 defines the concept of an estimator and some of 
the basic properties of estimators. Chapter 5 discusses estimation as a static problem in which time is not 
involved. Chapter 6 presents some simple results on stochastic processes. Chapter 7 covers the state estima- 
tion problem for dynamic systems with known coefficients. We first pose it as a static estimation problem, 
drawing on the results from Chapter 5. We then show how a recursive formulation results in a simpler solution 
process, arriving at the same state estimate. The derivation used for the recursive state estimator (Kalman 
filter) does not require a background in stochastic processes; only basic probability and the results from 
Chapter 5 are used. 

Chapters 8-10 present the parameter estimation problem for dynamic systems. Each chapter covers one of 
the basic estimation algorithms. We have considered parameter estimation as a problem in its own right, rather 
than forcing it into the form of a nonlinear filtering problem. The general nonlinear filtering problem is 
more difficult than parameter estimation for linear systems, and it requires ad hoc approximations for practi- 
cal implementation. We feel that our approach is more natural and is easier to understand. 

Chapter 11 examines the accuracy of the estimates. The emphasis in this chapter is on evaluating the 
accuracy and analyzing causes of poor accuracy. The chapter also includes brief discussions about the roles 
of model structure determination and experiment design. 


eRECEDING RAGE BLANK NO® FRMHO 


TABLE OF CONTENTS 


Page 


PREFACE i i 1 

NOMENCLATURE ix 

1.0 INTRODUCTION 1 

1.1 SYSTEM IDENTIFICATION 2 

1.2 PARAMETER IDENTIFICATION 3 

1.3 TYPES OF SYSTEM MODELS 5 

1.3.1 Explicit Function 5 

1.3.2 State Space 5 

1.3.3 Others 7 

1.4 PARAMETER ESTIMATION 7 

1.5 OTHER APPROACHES 10 

2.0 OPTIMIZATION METHODS 11 

2.1 ONE* DIMENSIONAL SEARCHES 12 

2.2 DIRECT METHODS 12 

2.3 GRADIENT METHODS 13 

2.4 SECOND ORDER METHODS 15 

2.4.1 Newton -Raph son 15 

2.4.2 Invariance 16 

2.4.3 Singularities 17 

2.4.4 Quasi -Newton Methods 18 

2.5 SUMS OF SQUARES 18 

2.5.1 Linear Case 19 

2.5.2 Nonlinear Case 19 

2.6 CONVERGENCE IMPROVEMENT 21 

3.0 BASIC PRINCIPLES FROM PROBABILITY 23 

3.1 PROBABILITY SPACES 23 

3.1.1 Probabi' ity Triple 23 

3.1.2 Conditional Probabilities 23 

3.2 SCALAR RANDOM VARIABLES 23 

3.2.1 Distribution and Density Functions 23 

3.2.2 Expectations and Moments 24 

3.3 JOINT RANDOM VARIABLES 24 

3.3.1 Distribution and Density Functions 24 

3.3.2 Expectations and Moments 24 

3.3.3 Marginal and Conditional Distributions * 25 

3.3.4 Statistical Independence 25 

3.4 TRANSFORMATION OF VARIABLES 26 

3.5 GAUSSIAN VARIABLES 26 

3.5.1 Standard Gaussian Distributions 27 

3.5.2 General Gaussian Distributions 27 

3.5.3 Properties 30 

3.5.4 Central Limit Theorem 33 


4.0 STATISTICAL ESTIMATORS 35 

4.1 DEFINITION OF AN ESTIMATOR 35 

4.2 PROPERTIES OF ESTIMATORS 36 

4.2.1 Unbiased Estimators 36 

4.2.2 Minimum Variance Estimators 37 

4.2.3 Cramer-Rao Inequality (Efficient Estimators) 37 

4.2.4 Bayesian Optimal Estimators 39 

4.2.5 Asymptotic Properties 39 

4.3 COMMON ESTIMATORS 40 

4.3.1 A posteriori Expected Value 40 

4.3.2 Bayesian Minimum Risk 40 

4.3.3 Maximum a posteriori Probability 41 

4.3.4 Maximum Likelihood 42 

5.0 THE STATIC ESTIMATION PROBLEM 45 

5.1 LINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE 45 

5.1.1 Joint Distribution of Z and c 45 

5.1.2 A posteriori Estimators ..... 46 

5.1.3 Maximum Likelihood Estimator 48 

5.1.4 Comparison of Estimators . 49 

5.2 PARTITIONING IN ESTIMATION PROBLEMS 50 

5.2.1 Measurement Partitioning 50 

5.2.2 Application to Linear Gaussian System 52 

5.2.3 Parameter Partitioning 53 

5.3 LIMITING CASES AND SINGULARITIES 54 

5.3.1 Singular P 55 

5.3.2 Singular GG* 55 

5.3.3 Singular CPC* + GG* 56 

5.3.4 Infinite P 57 

5.3.5 Infinite GG* • 58 

5.3.6 Singular C*(GG*)" l C + P" 1 58 


v 


no?: frmed 


preceding eage BLANK 







5.4 NONLINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE 58 

5.4.1 Joint Distribution of Z and £ 58 

I 5.4.2 Estimators 59 

* 5.4.3 Computation of the Estimates 60 

* 5.4.4 Singularities G1 

t 5.4.5 Partitioning 61 

J 5.5 MULTIPLICATIVE GAUSSIAN NOISE (ESTIMATION OF VARIANCE) 61 

I 5.6 NON-GAUSSIAN NOISE 64 

l 6.0 STOCHASTIC PROCESSES 69 

l 6.1 DISCRETE TIME 69 

t 6.1.1 Linear Systems Forced by Gaussian White Noise 69 

l 6.1.2 Nonlinear Systems and Non-Gaussian Noise 70 

5 6.2.1 Linear Systems Forced by White Noise 70 

| 6.2.2 Additive White Measurement Noise 72 

6.2.3 Nonlinear Systems 72 

7.0 STATE ESTIMATION FOR DYNAMIC SYSTEMS 73 

7.1 EXPLICIT FORMULATION 73 

7.2 RECURSIVE FORMULATION 75 

7.2.1 Prediction Step 75 

7.2.2 Correction Step 76 

7.2.3 Kalman Filter 76 

7.2.4 Alternate Forms 77 

7.2.5 Innovations 78 

7.3 STEADY-STATE FORM 79 

7.4 CONTINUOUS TIME 81 

7.5 CONTINUOUS/DISCRETE TIME 82 

7.6 SMOOTHING 84 

7.7 NONLINEAR SYSTEMS AND NON-GAUSSIAN NOISE 86 


8.0 OUTPUT ERROR METHOD FOR DYNAMIC SYSTEMS 89 

8.1 DERIVATION 90 

_ , 8.2 INITIAL CONDITIONS 91 

8.3 COMPUTATIONS 91 

- r 8.3.1 Gauss-Newton Method 91 

t 8.3.2 System Response 92 

| 8.3.3 Finite Difference Response Gradient ... 93 

* f 8.3.4 Analytic Response Gradient 93 

r 8.4 UNKNOWN G 94 

8.5 CHARACTERISTICS 95 


9.0 FILTER ERROR METHOD FOR DYNAMIC SYSTEMS 97 

9.1 DERIVATION 97 

9.1.1 Static Derivation 97 

9.1.2 Derivation by Recursive Factoring 98 

9.1.3 Derivation Using the Innovation 98 

9.1.4 Steady-State Form 99 

9.1.5 Cost Function Discussion 99 

9.2 COMPUTATION 100 

9.3 FORMULATION AS A FILTERING PROBLEM 100 

10.0 EQUATION ERROR METHOD FOR OYNAMIC SYSTEMS 1C1 

10.1 PROCESS-NOISE APPROACH 101 

10.1.1 Derivation 101 

10.1.2 Special Case of Filter Error 102 

10.1.3 Discussion 103 

10.2 GENERAL EQUATION ERROR FORM 104 

10.2.1 Discrete State-Equation Error 104 

10.2.2 Continuous/Discrete State-Equation Error 104 

10.2.3 Observation-Equation Error 106 

10.3 COMPUTATION 106 

10.4 DISCUSSION 107 


Vj 


l 11.0 ACCURACY OF THE ESTIMATES 109 

j 11.1 CONFIDENCE REGIONS 110 

11.1.1 Random Parameter Vector 110 

11.1.2 Nonrandom Parameter Vector Ill 

11.1.3 Gaussian Approximation . 112 

11.1.4 Nonstatistlcal Derivation 113 

11.2 ANALYSIS OF THE CONFIDENCE ELLIPSOID 113 

11.2.1 Sensitivity 113 

11.2.2 Correlation 114 

11.2.3 Cramer-Rao Bound 116 

11.3 OTHER MEASURES OF ACCURACY 117 < 

11.3.1 Bias 117 

11.3.2 Scatter 118 ! 

11.3.3 Engineering Judgment 118 5 

11.4 MODEL STRUCTURE DETERMINATION 119 

11.5 EXPERIMENT DESIGN 120 


vi 






14 







* 


12.0 SUMMARY 125 

A.O MATRIX RESULTS 127 

A. 1 MATRIX INVERSION LEMMAS 127 

A. 2 MATRIX DIFFERENTIATION 129 

REFERENCES 131 


I 

f- , 


i 


1 ! 

i 





V 1 1 



NOMENCLATURE 


SYKSOLS 

It is Impractical to list all of the symbols used in this document. The following are symbols of partlc 
ular significance and those used consistently in large portions of the document. In several specialized 
situations, the same symbols are used with different meanings not included In this list. 

A stability matrix 

B control matrix 

b(.) bias 

C state observation matrix 

0 control observation matrix 

E{.} expected value 

e error vector 

F{.) system function 

FF* process noise covarian :e matrix 

F x (.) probability distribution function of x 

f(.) system state function 

GG* measurement noise covariance matrix 

g(.) system observation function 

h(.) equation error function 

J(.) cost function 

M Fisher information matrix 

m^ prior mean of £ 

,n ( t ) process noise vector 

P prior covariance of £, or covariance of filtered x 

p(x) probability density function of x, short notation 

P x (*) probability density function of x, full notation 

Q covariance of predicted x 

R covariance of innovation 

t time 

U system input 

u^.u(t) dynamic system input vector 
V.j concatenated innovation vector 

v innovation vector 

x parameter vector in static models 

x i ,x(t) dynamic system state vector 

Z system response 

Zj concatenated response vector 

z i# z{t) dynamic system response vector 
A sample Interval 

measurement noise vector 

* state transition matrix ^CEDiNQ £AG£ BL*uN£ NO£ FltkfRD 


ix 





V input transition matrix 

£ vector of unknown parameters 

~ set of possible parameter values 

u) random noise vector 

ft probability space 

predicted estimate (in filtering contexts) 

optimum (in optimization contexts), or estimate (in estimation contexts), or filtered estimate (in 
filtering contexts) 

smoothed estimate 

Subscript £ indicates dependence on £ 

Abbreviations and acronyms 

arg max value of x that maximizes the following function 
x 

corr correlation 

cov covariance 

exp exponential 

In natural logarithm 

MAP maximum a posteriori probability 

MLE maximum-likelihood estimator 

mse mean-square error 

var variance 

i 

Mathematical notation ) 

f(.) the entire function f, as opposed to the value of the function at a particular point 
* transpose 

v x gradient with respect to the vector x (result is a row vector when the operand is a scalar, or a 

matrix when the operand is a column vector) 

vj second gradient with respect to x 

I series summation 

n series product 

tt 3 . 14159 ... 

u set union 

n set intersection 

c subset 

€ element of a set 

{x:c> the set of all x such that condition c holds 
(...) inner product 

| conditioned on (in probability contexts) 

|.| absolute value or determinant 

d| . | volume element 

t* right-hand limit at t^ 

n-vector vector with n elements 
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CHAPTER 1 


1.0 INTRODUCTION 

System identification is broadly defined as the deduction of system characteristics from measured data. 

It is commonly referred to as an inverse problem because It is the opposite of the problem of computing the 

response of a system with known characteristics. Gauss (1809, p. 85) refers to "the inverse problem, that is 
when the true is to be derived from the apparent place." The Inverse problem might be phrased as, "Given the 
answer, what was the question?" Phrased in such general terms, system Identification is seen as a simple 
concept used in everyday life, rather than as an obscure area of mathematics. 

Example 1.0-1 The system is your body, and the characteristic of interest is 

its mass. Tou perform an experiment by placing the system on a mechanical 

transducer in the bathroom which gives as output a position approximately 
proportional to the system mass and the local gravitational field. Based on 
previous comparisons with the doctor's scales, you know that your scale con- 
sistently reads 2 lb high, so you subtract this figure from the reading. The 
result Is still somewhat higher than expected, so you step off of the scales 
and then repeat the experiment. The new reading is more "reasonable" and from 
it you obtain an estimate of the system mass. 

This simple example actually includes several important principles of system identification; for Instance, 
the resulting estimates are biased (as defined in Chapter 4). 

Example 1.0-2 The "guess your weight" booth at the fair. 

The weight guesser's instrumentation and estimation algorithm are more difficult to describe precisely, 
but they are used tc solve the same system identification problem. 

Example 1.0-3 Newton's deduction of the theory of gravity. 

Newton's problem was much more difficult than the first two examples. He had to deduce not just a single 
number, but also the form of the equations describing the system. Newton was a true expert In system identi- 
fication (among other things). 

As apparent from the above examples, system identification is as much an art as a science. This point is 
often forgotten by scientists who prove elegant mathematical theorems about a model that doesn't adequately 
represent the true system to begin with. On the other hand, engineers who reject what they consider to be 
"Ivory tower theory" are foregoing tools that could give definite answers to some questions, and hints to aid 
in the understanding of others. 

System identification is closely tied to control theory, partially by some common methodology, and par- 
tially by the use of Identified system models for control design. Before you can design a controller for a 
system, you must have some notion of the equations describing the system. 

Another common purpose of system identification is to help gain an understanding of how a system works. 
Newton's investigations were more along this line. (It Is unlikely that he wanted to control the motion of 
the planets.) 

The application of system Identification techniques is strongly dependent on the purpose for which the 
results are intended; radically different system models and identification techniques rjy be appropriate for 
different purposes related to the same system. The aircraft control system designer will be unimpressed when 
given a model based on inputs that cannot be influenced, outputs that annot be measured, aspects of the 
system that the designer does not want to control, and a complicated model in a form not amenable to control 
analysis techniques. The same model might be ideal for the aerodynamiclst studying the flow around the 
vehicle. The first and most important step of any system identification application is to define its purpose. 

Following this chapter's overview, this document presents one aspect of the science of system Identifica- 
tion- the theory of statistical estimation. The theory's main purpose Is to help the engineer understand the 
system, not to serve as a formula for consistently producing the required results. Therefore, our exposition 
of the theory, although rigorously defensible, emphasizes intuitive understanding rather than mathematical 
sophistication. The following comnents of Luenberger '1969, p. 2) also apply to the theory of system 
identification: 


Some readers may look with great expectation toward functional analysis, hoping 
to discover new powerful techniques that will enable them to solve important 
problems beyond the reach of simpler mathematical analysis. Such hopes are 
rarely realized in practice. The primary utility of functional analysis. . .is 
its role as a unifying discipline, gathering a number of apparently diverse, 
specialized mathematical tricks into one or a few geometric principles. 

With good intuitive understanding, which arises from such unification, the reader will be better equipped to 
extend the ideas to other areas where the solutions, although simple, were not formerly obvious. 

The literature of the field often uses the terms "system identification," "parameter identification," and 
"parameter estimation" Interchangeably. The following sections define and differentiate these broad terms. 

The majority of the literature in the field, including most of this document, addresses the field most pre- 
cisely called parameter estimation. 
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1.1 


1.1 SYSTEM IDENTIFICATION 

We begin by phrasing the system Identification problem In formal mathematical terms. There are three 
elements essential to a system Identification problem: a system, an experiment, and a response. We define 

these elements here In broad, abstract, set-theoretic terms, before Introducing more concrete forms In 
Section 1.3. 

Let U represent some experiment, taken from the set ® of possible experiments on the system. 

U could represent a discrete event, such as stepping on the scales; or a value, such as a voltage applied. 

U could also be a vector function of time, such as the motions of the control surfaces while an airplane 1 
flown through a maneuver. In systems terminology, U Is the input to the system. (We will use the terms 
"Input," "control," and "experiment" more or less Interchangeably.) 

Observe the response Z of the system to the experiment. As with U, 2 could be represented In many 
forms including as a discrete event (e.g., "the system blew up") or as a measured time function. It Is rn 
element of the set (Z) of possible responses. (We also use the terms "output" or "measurement" for 2.) 

The abstract system Is a map (function) F from the set of possible experiments to the set of possible 
responses. 

F: ® - (Z) (1.1-1) 

that Is 

Z - F(U) (1.1-2) 

The system identification problem Is to reconstruct the function F from a collection of experiments 
Ut and the corresponding system responses Z-j. This Is tne purest form of the "black box" identification 
problem. We are asked to identify the system with no information at all about Its Internal structure, as if 
the system were in a black box which we could not see Into. Our only Information is the Inputs and outputs. 

An obvious solution Is to perform all of the experiments in © and simply tabulate the responses. This 
Is usually Impossible because the set © Is too large (typically, infinite). Also, we may not have complete 
freedom In selecting the Ui- Furthermore, even if this approach were possible, the tabular format of the 
result would generally be inconvenient and of little help in understanding the structure of the system. 

If we cannot perform all of the experiments In @, the system Identification problem Is impossible 
without further information. Since we have made no assumptions about the form of F, we cannot be sure of Its 
behavior without checking every point. 


Example 1.1-1 The input U and output Z of a system are both represented 
by" real -valued scalar variables. When an input of 1.0 is applied, the output 
is 1.0. When an input of -1.0 is applied, the output is also 1.0. Without 
further Information we cannot tell which of the following representations (or 
an infinite number of others) of the system is correct. 


a) Z ■ 1 


b) Z “ |U| 


c) Z ■ U 2 


(Independent of U) 



u 
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d) The response depends on the time Interval between applying U and 
measuring Z, which we forgot to consider. 

Example 1.1-2 The Input and output of a system are scalar time functions 
on the Interval (-,•). When the Input Is cos(t), the output is sln(t). 

Without more Information we cannot distinguish among 

a) 2 ( t ) ■ cos(t) Independent of U 

b) z(t) • f U(S)ds 

*o 

c) 2 (t) ■ u(t) 

d) Z(t) ■ U^t • | irj 

Example 1.1-3 The Input and output of a system are integers In the range 
MOO". For every Input except U ■ 37, we measure the output and find It 
equal to the Input. We have no mathematical basis for drawing any conclusion 
about the response to the Input U - 37. We could guess that the output might 
be Z ■ 37 » but there Is no mathematical justification for this guess In the 
problem as formulated. 

Our Inability to draw any conclusions In the above examples (particularly Example (1.1-3), which seems 
so obvious Intuitively) points out the inadequacy of the pure black-box statement of the system Identification 
problem. We cannot reconstruct the function F without some guidance on choosing a particular function from 
the Infinite number of functions consistent with the results of the experiments performed. 

We have seen that the pure black box system identification problem, where absolutely no Information Is 
given about the Internal structure of the system. Is Impossible to solve. The Information needed to construct 
the system function F is thus composed of two parts: Information which is assumed, and Information which Is 

deduced from the experimental data. These two Information sources can closely Interact. For Instance, the 
experimental data could contradict the assumptions made, requiring a revision of the assumptions, or the data 
could be used to select one of a set of candidate assumptions (hypotheses). Such Interaction tends to obscure 
the role of the assumption, making It seem as though all of the Information was obtained from the experimental 
data, and thus has a purely objective validity. In fact, this Is never the case. Realistically, most of the 
Information used for constructing the system function F will be assumptions based on knowledge of the nature 
of the physical processes of the system. System Identification technology based on experimental data Is used 
only to fill In the relatively small gaps In our knowledge of the system. From this perspective, we recognize 
system Identification as an extremely useful tool for filling in such knowledge gaps, rather than as a panacea 
which will automatically tell us everything we need to know about a system. The capabilities of some modern 
techniques may Invite the view of system identification as a cure-all, because the underlying assumptions are 
subtle and seldom explicitly stated. 

Example 1.1-4 Return to the problem of example (1.1-3). Seemingly, not much 
knowledge of the Internal behavior of the system Is required to deduce that 
Z will be 37 when U Is 37; Indeed, many common system Identification algo- 
rithms would make such a deduction. In fact, the assumptions made are numer- 
ous. The specification of the set of possible inputs and outputs already 
Implies many assumptions about the system; for instance, that there are no 
transient effects, or that such effects are unimportant. The problem state- 
ment does not allow for an event such as the system output's oscillating 
through several values. We have also made an assumption of repeatability. 

Perhaps the same experiment redone tomorrow would produce different results, 
depending on some factor not considered. Encompassing all of the other 
assumptions Is the assumption of simplicity. We have applied Occam's Razor 
and found the simplest system consistent with the data. One can easily 
Imagine useful systems that select specific Inputs for special treatment. 

Nothing In the data has eliminated such systems. We can see that the assump- 
tions play the largest role in solving this problem. Granted the assumption 
that we want the simplest consistent result, the deduction from the date that 
Z ■ U Is trivial. 

Two general types of assumptions exist. The first consists of restrictions on the allowable forms of 
the function F. Presumably, such restrictions would reflect the knowledge of what functions are reasonable 
considering the physics of the system. The second type of assumption Is some criterion for selecting a "best" 
function from those consistent with the experimental results. In the following sections, we will see that 
these two approaches are combined- restricting the set of functions considered, and then selecting a best 
choice from this set. 


1.2 PARAMETER IDENTIFICATION 

For physical systems, Information about the general form of the system function F can often be derived 
from knowledge of the system. Specific numerical values, however, are sometimes prohibitively difficult to 
compute tneoretlcally without making unacceptable approximations. Therefore, the most widely used area of 
system identification 1$ the subfield called parameter Identification. 
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1.2 


In parameter Identification, the form of the system function Is assumed to be known. This function con- 
tains a finite numbtr of parameters, the values of which must be deduced from experimental data. 

Let c be a vector with the unknown parameters as Its elements. Then the system response Z Is a known 
function of the Input U and the parameter vector t. We can restate this In a more convenient, but com- 
pletely equivalent way. For each value of the parameter vector t, the system response Z is a known function 
of the Input U. (The function can be different for different values of C.) We say that the function is 
parameterized by c and write 


l * F C (U) (1.2-1) 

The function F^(U) Is referred to as the assumed system model. The subscript notation for r Is used purely 
for convenience to Indicate the special role of t. The function could be equivalently written as F(t,U). 

The parameter Identification problem Is then to deduce the value of t based on measurement of the responses 
Zi to a set of inputs Ui- This problem of Identifying the parameter vector t Is much less ambitious tnan 
the system Identification problem of constructing the entire F function from experimental data; It is more 
In line with the amount of Information that reasonably can be expected to be obtained from experimental data. 

Deducing the value of t amounts to solving the for. wing set of simultaneous and generally nonlinear 
equations. 


z * • F^) 1 * 1 2, . . .N (1.2-2) 

where N Is the number of experiments performed. Note that the only variable In these equations is the param- 
eter vector £. The U< and Zi represent the specific input used and response measured for the 1th experi- 
ment. This Is quite different from Equation (1.2-1) which expresses a general relationship among the three 
variables U, Z, and t. 

Example 1.2-1 In the problem of example (1.1-1), assume we are given that the 
response Isa linear function of the input 

Z * F^(U) ♦ a 0 ♦ a A U 

The parameter vector is t * (a*.#!)*, the values of a # and a t being unknown. 

We were given that U * -1 and U ■ +1 both result in Z * 1; thus Equa- 
tion (1.2-2) expands to 

1 . F t (-1) 

1 » F c (l) <• a + a, 

This system Is easy to solve and gives a. • i and a a * 0. Thus we have 
F(U) « 1 (Independent of U). 

Example 1.2-2 In the ptwolem of example (1.1-2), assume we know that the sys- 
temcanbe represented as 


i(t) ■ az(t) + bu(t) 

or, equivalently, expressing Z as an explicit function of U, 


Z 


F C (U): 



bu(i )dr 


The unknown parameter vector for this system is t • (a,b)*. Since 
u(t) « cos(t) resulted In z(t) * sln(t), Equation (1.2-2) becomes 

sln(t) ■ J £ e*^* T ^ b cosMdt 

for all tc ( “• »•) * This equation Is uniquely solved by a * 0* and b * -1. 

Example 1.2-3 In the problem of Fxample (1.1-3), assume that the system can 
be represented by a polynomial of order 10 or less. 

1 * %<«> - £ •/ 

* n* o 

The unknown parameter vector 1st* Using the experimental 

data described In Example 1.6, Equation (1.2-2) becomes 


1 • £ * n 1 n 1 - 1,2.. .36.38,39. ..100 



V*' 



This system of equations Is uniquely solved by a D ■ 0, a* * 1, and a 2 
through a 10 all equalling 0 . 

As with any set of equations, there are three possible results from Equation (1.2-2). First, there can 
be a unique solution, as In each of the examples above. Second, there could be multiple solutions, in which 
case either more experiments must be performed or more assumptions would be necessary to restrict the set of 
allowable solutions or to pick a best solution In some sense. The third possibility is that there could be no 
solutions, the experimental data being inconsistent with the assumed equations. This situation will require a 
basic change in our way of thinking about the problem. There will almost never be an exact solution with real 
data, so the first two possibilities are somewhat academic. The remainder of the document, and Section 1.4 in 
particular, will address the general situation where Equation (1.2-2) need not have an exact solution. The 
possibilities of one or more solutions are part of the general case. 

Example 1.2- 4 In the problem of Example (1.1-1). assume we are given that 
the response is a quadratic function of the input 

2 ■ F^(w) * a 0 + a x U + a 2 U 2 

The parameter vector is 5 * (a 0 ,a.,a,)*. k. -..ere given that U* -1 and 
U * +1 both result in Z * 1. With these data Equation (1.2-2) expands to 

1 ■= F ? (-l) * a„ - t 1 + a 2 
1 - F ? (l) = a, + a 2 + a 2 

From this information we can deduce that * 0, but a 0 and a. are not 
uniquely determined. The values might be determined by performing the 
experiment U * 0. Alternately, we might decide that the lowest order 
system consistent with the data available is preferred, giving a 2 = 0 
and a 0 ■ 1 . 

Example 1.2-5 In the problem of Example (1.1-1), assume that we are given 
that the response is a linear function of the input. We were given that 
U * -1 and U * +1 both result in Z * 1. Suppose that the experiment 
U = 0 is performed and results in Z * 0.95. There are then no parameter 
values consistent with the data. 


1.3 TYPES OF SYSTEM MODELS 

Although tne basic concept of system modeling is quite general, more useful results can be obtained by 
examining specific types of system models. Clarity of exposition is also improved by using specific models, 
even when we can obtain the result in a more general context. This section describes some of the broad 
classes of system model forms which are often used in parameter identification. 


1.3.1 Explicit Function 


The most basic type of system model is the explicit function. The response Z is written as a known 
explicit function of the input U and the parameter vector £. This type of model corresponds exactly to 
Equation (1.2-1): 


Z = F 5 (U) 


( 1 . 2 * 1 ) 


In the simplest subset of the explicit function models, the response is a linear function of the 
parameter vector 

i f(U)c (1.3-1) 

In this equation, f(U) is a matrix which is a known function (nonlinear in general) of the input. This is the 
type of model used in linear regression. Many systems can be put into this easily analyzed form, even though 
the systems might appear quite complex at first glance. 

A conmon example of a model linear in its parameters is a finite polynomial expansion of Z in terms 
of U. 

z * to + CjU +' ?jU l ...+ tnU n (1-3-2) 


In this case, f(U) Is the row vector (1, U, U 2 ...U n ). Note that Z Is linear In the parameters ti» hut 
not in the input U. 


1.3.2 State Space 

State-space models are very useful for dynamic systems; that is, systems with responses that are time 
functions. Wlberg (1971) and Zadeh and Desoer (1963) give general discussions of state-space models. Time 
can be treated as either a continuous or discretized variable in dynamic models; the theories of discrete- and 
continuous-time systems are quite different. 
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1.3.2 


The general form for a continuous-tine state-space model Is 

x(t # ) * x 0 (1.3-3a) 

x(t) * f[x(t).u(t),t.c] (1.3-3b) 

z{t) * g[x(t),u(t),t,0 (i.3-3c) 

where f and g are arbitrary known functions. The initial condition x 0 can be known or c«n be a function 
of The variable x(t) is defined as the state of the system at time t. Equation (1.3-3b) is called the 
state equation, and (1.3-3c) is called the observation equation. The measured system response is z. The 
state is not considered to be measured; it is an internal system variable. However, g[x(t),u(t),t,c] * x(t) 
is a legitimate observation function, the measurement can be equal to the state if so desired. 


Discrete-time state space models are similar to continuous-time models, except that the differential 
equations are replaced by difference equations. The general form is 


X(t 0 ) * X, 

(1.3-4a) 

x(t 1+i ) * f[x(tf).u(t 1 ),t 1 ,<] 1 * 0 , 1 ,... 

(1.3-4b) 

*(t f ) * gtxUp.ultfhtf.c] 1 * 1 . 2 .... 

(1.3-4C) 

The system variables are defined only at the discrete times t-j. 


This document is largely concerned with continuous- time dynamic systems described by differential Equa- 
tions (l.3-3b). The system response, however, is measured at discrete time points, and the computations are 
done in a digital computer. Thus, some features of both discrete- and continuous-time systems are pertinent. 
The system equations are 

x(t 0 ) * X, 

(1.3-5a) 

x(t) = f[x(t),u(t).t,e] 

(1.3-5b) 

= g[x(t { ),u(t{i,tj,e] i » 1 , 2 ,... 

(1.3-5c) 


The response z(ti) is considered to be defined only at the discrete time points ti, although tne state x(t) 
is defined in continuous time. 


We will see that the theory of parameter identification for continuous-time systems with discrete obser- 
vations is virtually identical to the theory for discrete-time systems in spite of the superficial differences 
in the system equation forms. The theory of continuous-time observations requires much deeper aathematical 
background and will only be outlined in this document. Since practical application of the algorithms devel- 
oped generally requires a digital computer, the continuous-time theory is of secondary importance. 

An important subset of systems described by state space equations is the set of linear dynamic systems. 
Although the equations are sometimes rewritten in forms convenient for different applications, all linear 


dynamic system models can be written in the following forms: the continuous-tine form is 

x(t # ) - x 0 (1.3-6a) 

x ( t ) » Ax ( t ) + Bu(t) (1.3-6b) 

z(t) * Cx(t) + Du(t) (1.3-6C) 

The matrix A is called the stability matrix, B is called the control matrix, and C and D are called state 
and control observation matrices, respectively. The oiscrete-time form is 

x{t 0 ) * x 0 (1.3-7a) 

x(t i+1 ) - *x(t|) + Mt^) i * 0,1,... (1.3-7b) 

z(t 1 ) * Cxt^J ♦ DuUj) 1 = 1,2,... (1.3-7c) 


The matrices * and f are called the system transition matrices. The form for continuous systems with dis- 
crete observations is identical to Equation (1.3-6), except that the observation is defined only at the 
discrete time points. In all three forms, A, B, C, D, *, and v are matrix functions of the parameter 
vector 5 . These matrices are functions of time in general, but for notational simplicity, we *111 not 
exp‘ r 1 tly Indicate the time dependence unless It is important to a discussion. 

The continuous-time and discrete-time state-equation forms are closely related. In many applications, 
the discrete-time form of Equation (1.3-7) is used as a discretized approximation to Equation (1.3-6). In 
thl - case, the transition matrices 4 and r are related to the A and B matrices by the equations 

♦ * exp(Aa) (1.3-8a) 


* 


r 


exp(At)dt B 


( 1 . 3-8b ) 
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where 


4 * t 1+l - t i (1.3-8C) 

We discuss this relationship in more detail in Section 7.5. In a similar Manner, Equation (1.3-4) is sometimes 
viewed as an approximation to Equation (1.3-3). Although the principle in the nonlinear case is the same as 
in the linear case, we cannot write precise expressions for the relationship in such simple closed forms as in 
the linear case. 

Standardized canonical forms of the state-space equations (Wiberg, 1971) play an important role in some 
approaches to parameter estimation. We will not emphasize canonical forms in this document. The b«.$ic theory 
of parameter identification is the same, whether canonical forms are used or not. In some applications, 
canonical forms are useful, or even necessary. Such forms, however, destroy any internal relationship between 
the model structure and the system, retaining only the external response characteristics. Fidelity to the 
internal as well as to the external system characteristics is a significant aid to engineering judgment and to 
the incorporation of known facts about the system, both of which play crucial roles in system identification. 
For instance, we might know the values of many locations of the A matrix in its H natural M form. When the 
A matrix is transformed to a canonical form, these simple facts generally become unwieldy equations which 
cannot reasonably be used. When there is little useful knowledge of the internal system structure, the use of 
canonical forms becomes more appropriate. 

1.3.3 Others 


Other types of system models are used in various applications. This docuuent will not cover them explic- 
itly, but many of the ideas and results from explicit function and state space models can be applied to other 
model types. 


One of these alternate model classes deserves special mention because of its wide use. This is the class 
of auto-regressive moving average (AfiMA) models and related variants (Hajdasinski , Eykhoff, Damen, and van den 
Boom, 1982). Discrete-time ARMA models are in the general form 


*<*!> + + V< t i-n ) = b « u(t i> * 

Discrete-time ARHA models can be readily rewritten as linear state space models 
the theory which we will develop for state space models is directly applicable. 


V<V„> t 1 - 3 -*) 

(Schweppe, 1973). so all of 


1.4 PARAMETER ESTIMATION 

The examples in Section 1.2 were carefully chosen to have exact solutions. Real data is seldom so 
ooliging. Ho matter how careful we have been in selecting the form of the assumed system model, it will not 
be an exact representation of the system. The experimental data will not be consistent with the assumed model 
form for any value of the parameter vector The model may be close, but it will not be exact, if for no 
other reason than that the measurements of the response will be made with real, and thus imperfect, 
instruments. 


The theoretical development seems to have arrived at a cul-de-sac. The black box system identification 
problem was not feasible because there were too many solutions consistent with the data. To remove this diffi- 
culty, it was necessary to assixne a model form and define the problem as parameter identification. With the 
assumed model, however, there are no solutions consistent with the data. 

W* need to retain the concept of an assumed model structure in order to reduce the scope of the problem, 
yet avoid the inflexibility of requiring that the model exactly reproduce the experimental data. We do this 
by usiny the assumed model structure, but acknowledging that it is imperfect. The assumed model structure 
should include the essential characteristics of the true system. The selection of these essential character- 
istics is the most significant engineering judgment in system analysis. A good example is Gauss' (1809, 
p. xi) justification that the major axis of a cometary ellipse is not an essential parameter, and that a 
siipplified parabolic model is therefore appropriate: 

There existed, in point of fact, no sufficient reason why it should be taken 
for granted that the paths of comets are exactly parabolic: on the contrary, 
it must be regarded as in the highest degree improbable that nature should 
ever have favored such an hypothesis. Since, nevertheless, it was known, that 
the phenomena of a heavenly body moving In an ellipse or hyperbola, the major 
axis of which is very great relatively to the parameter, differs very little 
near the perihelion from the motion in a parabola of which the vertex is at 
the same distance from the focus; and that this difference becomes the more 
inconsiderable the greater the ratio of the axis to the parameter: and since, 
moreover, experience has shown that between the observed motion and the motion 
computed in the parabolic orbit, there remained differences scarcely ever 

?reater than those which might safely be attributed to errors of observation 
errors quite considerable in most cases): astronomers have thought proper to 
retain the parabola, and very properly, because there are no means whatever of 
ascertaining satisfactorily what, if any, are the differences from a parabola. 

Chapter 11 discusses some aspects of this selection, including theoretical aids to making such jud^nents. 

Given the assumed model structure, the primary question is how to treat imperfections in the model. 

We need to determine how to select the value of € which makes the mathematical model the "best" 



representation of the essential characteristics of the system. We also need to evaluate the error in the 
determination of E due to the unmodeled effects present in the experimental data. These needs introduce 
several new concepts. One concept is that of a “best" representation as opposed to the correct representation. 
It is often impossible to define a single correct representation, even in principle, because we have acknowl- 
edged the assumed model structure to be imperfect and we have constrained ourselves to work within this 
structure. Thus E does not have a correct value. As Acton (1970) says on this subject, 

A favorite form of lunacy among aeronautical engineers produces countless 
attempts to decide what differential equation governs the motion of some 
physical object, such as a helicopter rotor. ...But arguments about which 
differential equation represents truth, together with their fitting calcu- 
lations, are wasted time. 

Example 1.4-1 Estimating the radius of the Earth. The Earth is not a per- 
fect sphere and, thus, does not have a radius. Therefore, tne problem of 
estimating the radius of the Earth has no correct answer. Nonetheless, a 
representation of the Earth as a sphere is a useful simplification for 
many purposes. 

Even the concept of the “best" representation overstates the meaning of our estimates because there is no 
universal criterion for defining a single best representation (thus our quotes around "best**). Nany system 
identification methods establish an optimality criterion and use numerical optimization methods to compute the 
optimal estimates as defined by the criterion; indeed most of this document is devoted to such optimal esti- 
mators or approximations to them. To be avoided, however, is the common attitude that optimal (by some cri- 
terion) is synonymous with correct, and that any nonoptima 1 estimator is therefore wrong. Klein (1975) uses 
the term “adequate model" to suggest that the appropriate judgment on an Identified model is whether the model 
is adequate for its intended purpose. 

In addition to these concepts of the correct, best, or adequate values of E, we have the somewhat related 
issue of errors in the determination of E caused by the presence of unmodeled effects in the experimental 
data. Even if a correct value of e is defined in principle, it may not be possible to determine this value 
exactly from the experimental data due to contamination of the data by unmodeled effects. 

We can now define the task as to determine the best estinate of E obtainable from the data, or perhaps 
an adequate estimate of E» rather than to determine the correct value of E- This revised problem is more 
properly called parameter estimation than parameter identification. (Both terms are often used interchange- 
ably.) Implied subproblems of parameter estimation include the definition of the criteria for best or 
adequate, and the characterization of potential errors in the estimates. 

Example 1.4-2 Reconsider the problem of example (1.2-5). Although there is 
no linear model exactly consistent with the data, modeling the output as a 
constant value of 1 appears a reasonable approximation and agrees exactly with 
two of the three data points. 

One approach to parameter estimation is to minimize the error between the model response and the actual 
measured response, using a least squares or some similar ad hoc criterion. The values of the parameter 
vector e which result in the minimum error are called the best estimates. Gauss (1809, p. 162) introduced 
this idea: 

finally, as all our observations, on account of the imperfection of the 
instruments and of the senses, are only approximations to the truth, an 
orbit based only on the six absolutely necessary data may still be liable to 
considerable errors. In order to diminish these as much as possible, and 
thus to reach the greatest precision attainable, no other method will be 
given except to accumulate the greatest mmfcer of the most perfect observa- 
tions, and to adjust the elements, not so as to satisfy this or that set of 
observations with absolute exactness, but so as to agree with all In the 
best possible manner. 

This approach is easy to understand without extensive mathematical background, and it can produce excellent 
results. It is restricted to deterministic models so that the model response can be calculated. 

An alternate approach to parameter estimation introduces probabilistic concepts in order to take advan- 
tage of the extensive theory of statistical estimation. We should note that, from Gauss's time, these two 
approaches have been intimately linked. The sentence Immediately following the above exposition in Theorla 
Notus (Gauss, 1809, p. 162) is 

For which purpose, we will show in the third section how, according to the 
principles of the calculus of probabilities, such an agreement may be 
obtained, as will be, if in no one place perfect, yet in all places the 
strictest possible. 

In the statistical approach, all of the effects not Included in the deterministic system model are modeled as 
random noise; the characteristics of the noise and its position in the system equations vary for different 
applications. The probabilistic treatment solves the perplexing problem of how to examine the effect of the 
unmodeled portion of the system without first modeling It. The formerly unmodeled portion is modeled proba- 
bilistically, which allows description of Its general characteristics such as magnitude and frequency content, 
without requiring a detailed model. Systems such as this, which Involve both time and randomness, are referred 
to as stochastic systems. This document will examine a small part of the extensive theory of stochastic sys- 
tems, which can be used to define estimates of the unknown parameters and to characterize the properties of 
these estimates. 
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1.4 


Although this document will devote significant time to the treatment of the probabilistic approach, this 
approach should not be oversold. It is currently popular to disparage model -fitting approaches as nonrigorous 
avid without theoretical basis. Such attitudes ignore two important facts: first, in many of the most comnon 
situations, the ‘‘sophisticated 11 probabilistic approach arrives at the same estimation algorithm as the model - 
fitting approaches. This fact is often obscured by the use of buzz words and unenlightening notation, appar- 
ently for fear that the theoretical effort will be considered as wasted. Our view is that such relationships 
should be emphasized and clearly explained. The two approaches complement each other, and the engineer who 
understands both is best equipped to handle real world problems. The model -fitting approach gives good intui- 
f .ive understanding of such problems as modeling error, algorithm convergence, and identifiabil ity, among 
ithers. The probabilistic approach contributes quantitative characterization of the properties of the esti- 
mates (the accuracy), and an understanding of how these characteristics are affected by various factors. 

The second fact Ignored by those who disparage model fitting is that the probabilistic approach involves 
just as many (or more) unjustified ad hoc assumptions. Behind the smug front of mathematical rigor and sophis- 
tication lie patently ridiculous assumptions about the system. The contaminating noise seldom has any of the 
characteristics (Gaussian, white, etc.) assumed simply in order to get results in a usable form. More basic is 
the fact that the contaminating noise is not necessarily random noise at all. It is a composite of all of the 
otherwise unmodeled portions of the system output, some of which might be "truly'* random (deferring the 
philosophical question of whether truly random events exist), but some of which are certainly deterministic 
even at the macroscopic level. In light of this consideration, the "rigor" of the probabilistic approach is 
tarnished from the start, no matter how precise the inner mathematics. Contrary to the impressions often 
given, the probabilistic approach is not the single correct answer, but is one of the possible avenues that can 
give useful results, making on the average as many unjustified or blatantly false assumptions as the alterna- 
tives. Bayes (1736, p. 9), in an essay reprinted by Barnard (1958), made a classical statement on the role of 
assumptions in mathematics: 

It is not the business of the Mathematician to dispute whether quantities do 
in fact ever vary in the manner that is supposed, but only whether the notion 
of their doing so be intelligible; which being allowed, he has a right to take 
it for granted, and then see what deductions he can make from that supposi- 
tion...^ is not inquiring how things are in matter of fact, but supposing 
things to be in a certain way, what are the consequences to be deduced from 
them; and all that is to be demanded of him is, that his suppositions be 
intelligible, and his inferences just from the suppositions he makes. 

The demands on the applications engineer are somewhat different, and more in line with Bayes' (1736, p. 50) 
later statement in the same document. 

So far as Mathematics do not tend to make men more sober and rational thinkers, 
wiser and better men, they are only to be considered as an amusement, which 
ought not to take us off from serious business. 

A few words are necessary in defense of the probabilistic approach, lest the reader decide that it is not 
worthwhile to pursue. The main issue is the description of deterministic phenomena as random. This disagrees 
with coronon modern perceptions of the meaning and use of randomness for physical situations, in which random 
and deterministic phenomena are considered as quite distinct and well delineated. Our viewpoint owes more to 
t*e earlier philosophy of probability theory- that it is a useful tool for studying complicated phenomena 
which need not V inherently random (if anything is inherently random). Cramer (1946, p. 141) gives a classic 
exposition o f nis philosophy: 

[The following is descriptive of]... large and important groups of ra, dom 
experiments. Small variations in the initial state of the observed units, 
which cannot be detected by our instruments, may produce considerable changes 
in the final result. The complicated character of the laws of the observed 
phenomena may render exact calculation practically, if not theoretically, 
impossible. Uncontrollable action by small disturbing factors may lead to 
irregular deviations from a presumed "true value". 

It is, of course, clear that there is no sharp distinction between these 
various modL*s of randomness. Whether we ascribe e.g. the fluctuations observed 
* che results of a series of shots at a target mainly to small variations in 
t. ■ initial state of the projectile, to the complicated nature of the ballistic 
laws, or to the action of small disturbing factors, is largely a matter of 
taste. The essential thing is that, in all cases where one or more of these 
circumstances are present, an exact prediction of the results of individual 
experiments becomes impossible, and the irregular fluctuations characteristic 
of random e/periments will appear. 

We shall now see that, in cases of this character, there appears amidst 
all irregularity of fluctuations a certain typical form of regularity that 
will serve as the basis of the mathematical tieory of statistics. 

The probabilistic mttnods allow quantitative analysis of the general behavior of these complicated phenomena, 
even though we * e unable to model the exact behavior. 



10 

1.5 OTHER APPROACHES 


1.5 


Our aim in this document is to present a unified viewpoint of the system identification ideas leading 
to maxi mum- likelihood estimation of the parameters of dynamic systems, and of the application of these ideas. 
There are many completely different approaches to identification of dynamic systems. 

There are innumerable books and papers in the system identification literature. Eykhoff (1974) and 
Astrom and Eykhoff (1970) give surveys of the field. However, much of the work in system identification is 
published outside of the general body of system identification literature. Many techniques have been devel- 
oped for specific areas of application by researchers oriented more toward the application area than toward 
general system identification problems. These specialized techniques are part of the larger field of system 
identification, although they are usually not labeled as such. (Sometimes they are recognizable as special 
cases or applications of more general results.) In the area most familiar to us, aircraft stability and con- 
trol derivatives were estimated from flight data long before such estimation was classified as a system 
identification problem (Doetsch, 1953; Etkin, 1958; Flack, 1959; Greenberg, 1951; Rampy and Berry, 1964; 
Wolowicz, 1966; and Holowicz and Holleman, 1958). 

He do not even attempt here the monumental task of surveying the large body of system identification 
techniques. Suffice it to say that other approaches exist, some explicitly labeled as system identification 
techniques, and some not so labeled. We feel that we are better equipped to make a useful contribution by 
presenting, in an organized and comprehensible manner, the viewpoint with which we are most familiar. This 
orientation does not constitute a dismissal of other viewpoints. 

We have sometimes been asked to refute claims that, in some specific application, a simple technique such 
as regression obtained superior results to a “sophisticated" technique bearing impressive-sounding credentials 
as an optimal nonlinear maximum likelihood estimator. The implication is that simple is somehow synonymous 
with poor, and sophisticated is synonymous with good, associations that we completely disavow. Indeed, the 
opposite association seems more often appropriate, and we try to present the maximum likelihood estimator in 
a simple light. We believe that these methods are all tools to be used when they help do the job. We have 
used quotations from Gauss several times in this chapter to illustrate his insight into what are still some of 
the important issues of th^ day, and we will close the chapter with yet another (Gauss, 1809, p. 108): 

...we hope, therefore, it will not be disagreeable to the reader, that, besides 
the solution to be given hereafter, which seems to leave nothing further to be 
desired, we have thought proper to preserve also the one of which we have made 
frequent use before the former suggested itself to me. It is always profitable 
to approach the more difficult problems in several ways, and not to despise the 
good although preferring the better. 


•• * ' '« 
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2.0 OPT I HI ZAT I OH METHODS 


Most of the estimators in this book require the minimization or maximization of a nonlinear function. 
Sometimes we can write cn explicit expression for the minimum or maximum point. In many cases, however, we 
must use an iterative numerical algorithm to find the solution. Therefore a background in optimization methods 
is mandatory for appreciation of the various estimators. 

Optimization is a major field in its own right and we do not attempt a thorough treatment or even a survey 
of the field in this chapter. Our purpose is to briefly introduce a few of the optimization techniques most 
pertinent to parameter estimation. Several of the conclusions we draw about tne relative merits of various 
algorithms are influenced by the general structure of parameter estimation problems and, thus, might not be 
supportable in a broader context of optimizing arbitrary functions. Numerous books such as Rao (1979), 
Luenberger (1969), Luenberger (1972), Dixon (1972), and Polak (1971) cover the detailed derivation and analysis 
of the techniques discussed here and others. These books give mot > thorough treatments of the optimization 
methods than we have room for here, but are not oriented specific.' ly to parameter estimation problems. For 
those involved in the application of estimation theory, and particularly for those who will be writing computer 
programs for parameter estimation, we strongly recoirmend reading several of these books. The utility and effi- 
ciency of a parameter estimation program depend strongly on its optimization algorithms. The material in this 
chapter should be sufficient for a general understanding of the problems and the kinds of algorithms used, but 
not for the details of efficient application. 

The basic optimization problem is to find the value of the vector x that gives the smallest or largest 
value of the scalar-valued function J(x). By convention we will talk about minimization problems; any maxi- 
mization problem can be made into an equivalent minimization problem by changing the sign of the function. We 
will follow the widespread practice of calling the function to be minimized a cost function, regardless of 
whether or not it really has anything to do with monetary cost. To formalize the definition of the problem, 
a function J(x) is said to have a minimum at x if 

J(x) < J(x) (2.0-1) 

for all x. This is sometimes called an unconstrained global minimum to distinguish it from local and con- 

strained minima, which are defined below. 

Two kinds of side constraints are sometimes placed on the problem. Equality constraints are in the form 

9i (x) = 0 (2.0-2) 

Inequality constraints are in the form 

h 1 (x) < 0 (2.0-3) 

The 9 i and h-j are scalar-valued functions of x. There can be any number of constraints on a problem. A 
value of x is called admissible if it satisfies all of the constraints; if a value violates any of the con- 
straints it is inadmissible. The constraints modify the problem statement as follows: x is the constrained 

minimum of J(x) if x is admissible and if Equation (2.0-1) holds for all admissible x. 

Two crucial questions about any optimization problem are whether a solution exists and whether it is 
unique. These questions are important in application as well as in theory. A computer program can spend a 
long time searching for a solution that does not exist. A simple example of an optimization problem with no 
solution is the unconstrained minimization of J(x) = x. A problem can also fail to have a solution because 
there is no x satisfying the constraints. We will say that a problem that has no solution is ill-posed. 

A simple problem with a nonunique solution is the unconstrained minimization of J(x) * (x L - x 2 ) 2 , where x 
is a 2-vector. 


All of the algorithms that we discuss (and most other algorithms) search for a local minimum of the func- 
tion, rather than the global minimum. A local minimum (also called a relative minimum) is defined as follows: 
x is a local minimum of J(x) if a scalar c > 0 exists such that 

J(x) < J(* + h) (2.0-4) 

for all h with |h| < e. To define a constrained local minimum, we must add the qualifications that x 
and x + h satisfy the constraints. The term "extremum" refers to either a local minimum or a local maximum. 
Figure (2.0-1) illustrates a problem with three local minima, one of which is the global minimum. 

Note that if a global minimum exists, even If it is not unique, it is also a local minimum. The converse 

to this statement is false; the existence of a local minimum does not even imply that a global minimum exists. 

We can sometimes prove that a function has only one local minimum point, and that this point is also the 

global minimum. When we lack such proofs, there is no universal way to guarantee that the local minimum found 

by an algorithm is the global minimum. A reasonable check for Iterative algorithms is to try the algorithm 
with many different starting values widely distributed within the realm of possible values. If the algorithm 
consistently converges to the same starting point, that point is probably the global minimum. The cost of such 
a test, however, is often prohibitively high. 

The likelihood of local minima difficulties varies widely depending on the application. In some applica- 
tions we can prove that there are no local minima except at the unique global minimum. At the ether extreme, 
some applications are plagued by numerous local minima to the extent that most minimization algorithms are 
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worthless. Most applications lie between these extremes. We can often argue convincingly that a particular 
answer must be the global minimum, even when rigorous proof is impractical. 

The algorithms in this chapter are, with a few exceptions, iterative. Given some starting value x 0 , the 
algorithms give a procedure for computing a new value x x ; then x 2 is computed from x lf etc. The intent of 
the iterative algorithms is to create a sequence Xj that converges to the minimum. The starting value can 
be from an independent estimate of a reasonable answer, or It can come from a special start-up algorithm. The 
final step of any iterative algorithm is testing convergence. After the algorithm has proceeded for some time, 
we need to choose among the following alternatives: 1) the algorithm has converged to a value sufficiently 

close to the true minimum and should therefore be terminated; 2) the algorithm is making acceptable progress 
toward the solution and should be continued; 3) the algorithm is failing to converge o is converging too 
slowly to obtain a solution in an acceptable time, and it should therefore be abandoned; 0 ** 4) the algorithm 
is exhibiting behavior that suggests that switching to a different algorithm (or modifying the current one) 
might be productive. This decision is far from trivial because some algorithms can essentially stall at a 
point far from any local minimum, making such small changes in Xj that they appear to have converged. 

We have briefly mentioned the problems of existence and uniqueness of solutions, local minima, starting 
values, and convergence tests. These are major issues in practical application, but we will not examine them 
further here. The references contain considerable discussion of these issues. 

A cost function of an N-dimensional x vector can be visualized as a hypersurface in (N + l)-dimensional 
space. For illustrating the behavior of the various algorithms, we will use isocline plots of cost functions 
of two variables. An isocline is the locus of all points in the x-space corresponding to some specified cost 
function value. The isoclines of positive definite quadratic functions are always ellipses. Furthermore, a 
quadratic function is completely specified by one of its isoclines and the fact that it is quadratic. Two- 
dimensional examples are sufficient to illustrate most of the pertinent points of the algorithms. 

We will consider unconstrained minimization problems, which Illustrate the basic points necessary for our 
purposes. The references address problems with equality and inequality constraints. 


2.1 ONE-DIMENSIONAL SEARCHES 

Optimization methodology is strongly influenced by whether or not x is a scalar. Because the optimiza- 
tion problems in this book are generally multi-dimensional, the methods applicable only to scalar x are not 
directly relevant. 

Many of the multi-dimensional optimization algorithms, however, require the solution of one-dimensional 
subproblems as part of the larger algorithm. Most such subproblems are in the form of minimizing the multi- 
dimensional cost function with x constrained to a line in the multi-dimensional space. This has the super- 
ficial appearance of a multi-dimensional problem, and furthermore one with the added complications of con- 
straints. To clarify the one-dimensional nature of these subproblems, express them as follows: the vector x 

is restricted to a line defined by 

x = x 0 + xx 1 (2.1-1) 

wh*re x 0 and x, are fixed vectors, and A is a scalar variable representing position along the line. 
Restricted to this line, the cost can be written as a function ot A. 

g(A) = J(x 0 + ax x ) (2.1-2) 

The function g(\) is a scalar function of a scalar variable, and one-dimensional minimization algorithms apply 
directly. Substituting the minimizing value of A into Equation (2.1-1) then gives the minimizing point along 
the line in the space of x. 

We will not examine the one-dimensional search algorithms in this book. Several of the references have 
good treatments of the subject. We will note that most of the relevant one-dimensional algorithms involve 
approximating the function by a low-order polynomial based on the values of the function and its first and 
second derivatives at one or more points. The minimum point of the polynomial, explicitly evaluate replaces 
one of the original points, and the process repeats. The distinguishing features of the algorithms are the 
order of the polynomial, the number of points, and the order of the derivatives of J(x) evaluated. Variants 
of the algorithms depend on start-up procedures and methods for selecting the point to be replaced. 

In some special cases we can solve the one-dimensional minimization problems explicitly by setting the 
derivative to zero, or by other means, even when we cannot explicitly solve the encompassing multi-dimensional 
problem. Several of our examples of multi-dimensional algorithms will use explicit solutions of the one- 
dimensional subproblems to avoid getting bogged down in detail. Real problems seldom will be so conveniently 
amenable to exact solution of the one-dimensional subproblems, except where the multi -dimensional problem could 
be directly solved without resort to iterative methods. Iterative one-dimensional searches are usually neces- 
sary with any method that Involves one-dimensional subproblems. We will encounter one of the rare exceptions 
in the estimation of variance. 


2.2 DIRECT METHODS 

Optimization methods that do not require the evaluation of derivatives of the cost function are called 
direct methods or zero-order methods (because they use up tc zeroth order derivatives). These methods use 
only the cost function values. 

Axial iteration, also called the univariate method or coordinate descent, is the basis for many of the 
direct methods. In this method we search along each of the coordinate directions of the x-space, one at a 
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time. Starting with the point x } , fix the values of all but the first coordinate, reducing the problem to 
one-dimensional minimization. Solve this problem using any cne-dimensional algorithm. Call the resulting 
point x.. Then fix the first coordinate at the value so determined and do a similar search along the direc- 
tion of the second coordinate, giving the point x 2 . Continue these one-dimensional searches until each of the 
N coordinate directions has been searched; the final point of this process is x^. 

The point xr completes the first cycle of minimization. Repeat this cycle starting from the point xr 
instead of x . Continue repeating the minimization cycle until the process converges (or until you give up, 
which may well come first). 

The performance of the axial iteration algorithm on most problems is unacceptably poor. The algorithm 
performs well only when the minimum point along each axis is nearly independent of the values of the other 
coordinates. 


Example 2.2-1 Use axial iteration to minimize J(x,y) * A(x - y) 2 + B(x + y) 2 
with A » B. The solution is the trivially obvious (0,0), but the problem 
is good for illustrating the behavior of algorithms in a simple case. Instead 
of using a one-dimensional search procedure, we will explicitly solve the one- 
dimensional subproblems. For any fixed y, obtain the minimizing x coordi- 
nate value by setting the derivative to zero 

J(x,y) • 2A(x - y) + 2B(x + y) * 0 


giving 


x ■ rrl* 


Similarly, for fixed x, the minimizing y value is 

y - r+T x 

We see that for A » B, the values of x and y descend slowly toward the true minimum at (0,0). 

Figure (2.2-1) illustrates this behavior on an isocline plot. Note that if A - B (the ^st function isocline 
is circular) the exact minimum is obtained in one cycle, but as A/B increases the performance worsens. 

Several modifications to the basic axial iteration method are available to improve its performance. Some 
of these modifications exploit the notion of the pattern direction, the direction from the beginning point 
XjxN a cycle to the end point xfj+^xw of the same cycle. Figure (2.2-2) illustrates the pattern direc- 

tion, which tends to point in the general direction of the minimum. Powell's method is the most powerful of 
the direct methods that search along pattern directions. See the references for details. 


2.3 GRADIENT METHODS 

Optimization methods that use the first derivative (gradient) of the cost function are called gradient 
methods or first order methods. Gradient methods require that the cost function be differentiable; most of the 
cost functions considered in this book meet this requirement. The gradient methods generally converge <n fewer 
iterations than many of the direct methods because the gradient methods use more information in each ■’ceration. 
(There are exceptions, particularly when comparing simple-minded gradient methods with the most powe’ ful of the 
direct methods). The penalty paid for the generally improved performance of the gradient methods compared with 
the direct methods is the requirement to evaluate the gradient. 

We define the gradient of the function J(x) with respect to x to be the row vector. (Some texts define 
it as a column vector; the difference is inconsequential as long as one is consistent.) 


V (x) = 


r» 

3 

jl! 

X 

3X 2 1 

' 3 x nJ 


J(x) 


(2.3-1) 


A reasonable estimate of the computational cost of evaluating the gradient is N times the cost of evaluating 
the function. This estimate follows from the fact that the gradient can be approximately evaluated by N 
finite differences 


!>J(x) . * Ee i } - J( *U 

“ t 


(2.3-2) 


where is the unit vector along the x-f axis and e Is a small number. In special cases, there can be 
expressions for the gradient that cost significantly less than N function evaluations. 


Equation (2.3-2) somewhat obscures the distinction between the gradient methods and the direct methods. 
We can rewrite any gradient method in a finite difference form that does not explicitly involve gradients. 
There is, nonetheless, a fairly clear distinction between methods derived from gradient ideas and methods 
derived from direct search ideas. We will retain this philosophical distinction regardless of whether the 
gradients are evaluated explicitly or by finite differences. 

The method of steepest descent (also called the gradient method) involves a series of one-dimensional 
searches, as did the axial-iteration method and its variants. In the steepest-descent method, these searches 
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are along the direction of the negative of the gradient vector, evaluated at the current point. The one- 
dimensional problem is to find the value of X that minimizes 

0 t (x) = J(x 1 + xs^ (2.3-3) 

where s^ is the search direction given by 

5l .-v;j(x)| XSXi (2.3-4) 

The negative of the gradient is the direction of steepest local descent of the cost function (thus the 
name of the method). To prove this property, first note that for any vector s we have 

& 0(x + US) * <s.V*J(x)> (2.3-5) 

We are using the (...) notation for the inner product 

<x,y> = x*y (2.3-6) 

Equation (2.3-5) is a generalization of the definition of the gradient; it applies in spaces where Equa- 
tion (2.3-1) is not meaningful. We then need only show that, if s is restricted to be a unit vector. 

Equation (2.3-5) is minimized by choosing s in the direction of -7$J(x). This follows immediately f»*om the 
Cauchy-Schwartz inequality (Luenberger, 1969) of linear algebra. 

Theorem 2.3-1 (Cauchy-Schwartz) <x,y> 2 s |x| 2 |y| 2 with equality if and only if x * ay for 
some scalar a. 

Proof The theorem is trivial if y - 0. For y f 0 examine 

<x + Xy,x Xy> = <x»x> + X 2 <y,y> + 2X(x,y> > 0 (2.3-7) 

Choose 

X = -<x,y)/(y,y> (2.3-8) 

Substitute into Equation (2.3-7) and rearrange to give 

<x,y> 2 < <x,x><y,y> = | x | 2 |y | 2 (2.3-9) 

Equality holds if and only if x + xy = 0 in Equation (2.3-7), which will be 
true if and only if x * ay (x will then be -a). 

On the surface, the steepest descent property of the method seems to imply excellent performance in mini- 
mizing the cost function value. The direction of steepest descent, however, is a local property which might 
point far from the direction of the global minimum. It is thus often a poor choice of search direction. 

Direct methods such as Powell's often converge more rapidly than steepest descent. 

The steepest descent method performs worst in long narrow valleys of the cost function. It is also sensi- 
tive to scaling. These two difficulties are closely related; rescaling a problem can easily create long 
narrow valleys. The following examples illustrate the scaling and valley difficulties: 

Example 2.3-1 Let the cost function be 

J(x) * | (x* + X*) 

The steepest descent method works excellently for this cost function (so does 
almost every optimization method). The gradient of J(x) is 

v x J(x) « (x, t x 2 ) = x* 

Therefore, from any starting point, the negative of the gradient points 
exactly at the origin, which is the global minimum. The minimum will be 
attained exactly (or to the accuracy of the one-dimensional search methods 
used) in one iteration. Figure (2.3-1) illustrates the algorithm starting 
from the point (1,1)*. 

Example 2.3-2 Rescale the preceding example by replacing x x by O.lXj. 

(Pernaps we just redefined the units of x ; to be millimeters Instead of 
centimeters.) The cost function is then 

J(x) ■ | (O.OlXj + x*) 


and the gradient is 


v x J(x) * (O.OlXj ,x 2 ) 

Figure (2.3-2) shows the search direction used by the algorithm starting from 
the point (10,1)*, which corresponds to the point (1,1)* in the previous 
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example. The search direction points almost 90° from the origin. A careless 
glance at Figure (2.3*2) invites the conclusion that the minimum in the 
search direction will be on the x axis and tnus that the second iteration 
of the steepest descent algorithm will attain the minimum. It is true that 
the minimum is close to the x axis, but it is not exactly on the axis; the 
distinction makes an important difference In the algorithm's performance. 

For points x * xvjj(x) along the search direction from any point 
(x 1# x 2 )*, the cost function is 

g(X) * f (x - x?Jj(x)) * \ [O.OlxJd - 0.0U) 2 + x|(l - x) 2 ] 

The minimum of g(X) is at 

(0.01) 2 x* + x* 

(0 .01 ) 3 xJ + x* 

and thus the minimum point along the search direction is 
(x x - O.OlXjX , x 2 - x 2 x)* 

with X defined as above. The following table and Figure (2.3-3) show 
several iterations of this process starting from the point (10,1)*. 


Iteration x x 

*2 

0 

10 

1 

1 

9.899 

-.009899 

2 

4.900 

.4900 

3 

4.851 

-.004851 

4 

2.401 

.2401 

5 

2.377 

-.002377 

6 

1.176 

.1176 

7 

1.165 

-.001165 


The trend of the algorithm is clear; every two iterations it moves essentially 
halfway to the solution. Consider the behavior starting from the point 

(10,0.1)* Instead of (10,1)*: i 


Iteration 

x i 

*2 

i 

0 

10 

0.1 


1 

9.802 

-.09802 


2 

9.608 

.09608 


3 

9.418 

-.09418 


4 

9.231 

.09231 

U * 

5 

9.048 

-.09048 

■ 

6 

8.869 

.08869 

1 

\ 

7 

8.694 

-.08694 


This behavior, plotted in Figure (2.3-4), is abysmal. The algorithm is bounc- 
ing back and forth across the valley, making little progress toward the 
minimum. 

Several modifications to the steepest descent method are available to improve Its performance. A rescaling 
step to eliminate valleys caused by scaling yields major improvements for some problems. The method of paral- 
lel tangents (PARTAN method) exploits pattern directions similar to those discussed In Section 2.2; searches in 
such pattern directions are often called acceleration steps. The conjugate gradient method Is the most power- 
ful of the modifications to steepest descent. The references discuss these and other gradient algorithms In 
detail . 


\ 


i 


1 
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2.4 SECOND ORDER METHODS . j 

Optimization methods that use the second derivative (or an approximation to It) of the cost .function are l 

called second order methods. These methods require that the first and second derivatives of the cost function ! 

exist. j 

2.4.1 Newton-Raphson j 

The Newton-Raphson optimization algorithm (also called Newton's method) Is the basis for all of the second ! 

order methods. The Idea of this algorithm Is to approximate the cost function by the first three terms of its 
Taylor series expansion about the current point. 


■ * tto 
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(2.4-1) 


J^x) = J(x 1 ) + (x - x 1 )*V*J(x 1 ) + j (x - x 1 )*[vjj(x 1 )](x - x,) 

From a geometric viewpoint, this equation describes the paraboloid that best approximates the function near 
x-j. Equating the gradient of J-j(x) to zero gives an equation for the minimum point of the approximating 
function. Taking tM$ gradient, note that v x J (x^ ) and v$J(xj) are evaluated at the fixed point and 
thus are not functions of x. 


Vi (x) * V( X 1> + (x ‘ x i )] =■ o 


(2.4-2) 


The solution Is 


x ■ x 1 - [vjj(x i )]**vj0(x i ) (2.4-3) 

If the second gradient of J is positive definite, then Equation (2.4-3) gives the exact un< minimum of 
the approximating function; It Is a reasonable guess at an approximate minimum of the orlglna function. If 
the second gradient Is not positive definite, then the approximating function does not have a unique minimum 
and the algorithm Is likely to perform poorly. The Newton-Raphson algorithm uses Equation (2.4-3) Iteratively; 
the x from this equation is the starting point for the next iteration. The algorithm Is 

X x +1 * X 1 ' t v i j ( x i)l‘ 1 vjj(x i ) (2.4-4) 

The performance of this algorithm in the close neighborhood of a strict local minimum Is unexcelled; this 
performance represents an Ideal toward which other algorithms strive. The Newton-Raphson algorithm attains 
the exact (except for numerical round-off errors) minimum of any positive-definite quad. stl function in a 
single iteration. Convergence within 5 to 10 iterations Is common on some practical nonquadratic problems 
with several dozen dimensions; direct and gradient methods typically count iterations In hundreds and thousands 
for such problems and settle for less accurate answers. See the references for analysis of convergence 
characteristics. 

Three negative features of the Newton-Raphson algorithm balance its excellent convergence near the mini- 
mum. First Is the behavior of the algorithm far from the minimum. If the initial estimate Is far from the 
minimum, the algorithm often converges erratically or even diverges. Such problems are often associated with 
second gradient matrices that are not positive definite. Because of this problem, it Is common to use special 
start-up procedures to get within the area where Newton-Raphson performs well. One such procedure Is to start 
with a gradient metnod, switching to Newton-Raphson near the minimum. There are many other start-up proce- 
dures, and they play a key role in successful applications of the Newton-Raphson algorithm. 

The second negative feature of the Newton-Raphson method is the computational cost and complexity of eval- 
uating the second gradient matrix. The magnitude of this difficulty varies widely among applications. In some 
special cases the second gradient Is little harder to compute than the first gradient; Newton-Raphson, perhaps 
with a start-up procedure, is a good choice for such applications. If, at the other extreme, you are reduced 
to finite-difference computation of the second gradient, Davidon-Fletcher-Powell (Section 2.4.4) is probably 
a more appropriate algorithm. In evaluating the computational burden of Newton-Raphson and other methods, 
remember that Newton-Raphson requires no one-dimensional searches. Equation (2.4-4) constitutes the entire 
algorithm. The one-dimensional searches required by most other algorithms can account for a majority of their 
computational cost. 

The third negative feature of the Newton-Raphson algorithm is the necessity to invert the second gradient 
matrix (or at least to solve the set of linear equations involving the matrix). The computer time required 
for the Inversion is seldom an Issue; this time Is typically small compared to the time required to evaluate 
the second gradient. Furthermore, the algorithm converges quickly enough that If one linear system solution 
per Iteration Is a large fraction of the total cost, then the total cost must be low, even if the linear system 
is on the order of lQ0-by-100. The crucial Issue concerning the Inversion of the second gradient Is the possi- 
bility that the matrix could be singular or ill-conditioned. We will discuss singularities in Section 2.4.3. 

2.4.2 Invariance 


The Newton-Raphson algorithm has far less difficulty with long narrow valleys of the cost function than 
does the steepest-descent method. This difference is related to an invariance property of the Newton-Raphson 
algorithm. Invariance of minimization algorithms is a useful concept which many texts mention briefly, If at 
all. We will therefore elaborate somewhat on the subject. 

The examples In the section on steepest descent Illustrate a strong link between scaling and narrow 
valleys. Scaling changes can easily create such valleys. Therefore we can generally state that minimization 
methods that are sensitive to scaling changes are likely to behave poorly In narrow valleys. 

This reasoning suggests a simple criterion for evaluating optimization algorithms: a good optimization 
algorithm should be Invariant under scaling changes. This principle Is almost so self-evident as to be 
unworthy of mention. The user of a program would be justifiably disgruntled If an algorithm that worked In 
the English Gravitational System (Imperial System) of units failed when applied to the same problem expressed 
In metric inlts (or vice versa). Someone trying to duplicate reported results would be perplexed by data 
published In metric units which could be duplicated only by converting to English Gravitational System units, 
in which the computation was really done. Nonetheless, many cornnon algorithms, Including the steepest descent 
method, fail to exhibit Invariance under scaling. 

The criterion Is neither necessary 'r sufficient. It is easy to construct ridiculous algorithms that are 
invariant to scale changes (such as tht gorlthm that always returns the value zero), and scale-sensitive algo- 
rithms like the steepest descent method nave achieved excellent results In some applications It is safe to 
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state, however, that you can usually Improve a good scale-sensitive algorithm by making It scale-invariant. 

An initial step that rescales the problem can effectively make the steepest-descent method scale-invariant 
(although such a step destroys a different Invariance property of the steepest-descent method: Invariance 

under rotation of coordinates). Rescaling a problem can be done manually by the user, or It can be an auto- 
matic part of an algorithm; automatic rescaling has the obvious advantage of being easier for the user, and a 
secondary advantage of allowing dynamic scaling changes as the algorithm proceeds. 

We can extend the Idea of invariance beyond scale changes. In general, we would like an algorithm to be 
Invariant under the largest possible set of transformations. A justification for this criterion is that 
almost any complicated minimization problem can be expressed as some transformation (possibly quite compli- 
cated) of a simpler problem. We can sometimes use such transformations to simplify the solution of the origi- 
nal problems. Often It Is more difficult to do the transformation than to solve the original optimization 
problem. Even if we cannot do the transformations, we can use the concept to conclude that an optimization 
algorithm Invariant over a large class of transformations is likely to work on a large class of prcblems. 

The Newton-Raphson algorithm Is Invariant under all Invertible linear transformations. This Is the widest 
Invariance property that we can usually achieve. 

The scale-invariance of the Newton-Raphson algorithm can be partially nullified by poor choice of matrix 
Inversion (or linear system solution) algorithms. We have assumed exact arithmetic In the preceding discussion 
of scale-invariance. Some matrix inversion routines are sensitive to scaling effects. Inversion based oc 
Cholesky factorization (Wilkinson, 1965, and Acton, 1970) Is a good, easil> implemented method for ru 

matrices (the second gradient Is always sywnetric), and Is Insensitive to scaling. Alternatively, pie- 

scale the matrix by using Its diagonal elements. 

2.4.3 Singularities 

The second gradient matrix used In the Newton-Raphson algorithm Is positive definite in a region near a 
strict local minimum. Ideally, the start-up procedure will reach such a region, and the Newton-Raphson algo- 
rithm will then converge without needing to contend with singularities. This viewpoint Is overly optimistic; 
singular or ill-conditioned matrices (the difference is largely academic) arise In many situations. In the 
following discussion, we discount the effects of scaling. Matrices that have large condition numbers because 
of scaling do not represent Intrinsically Ill-conditioned problems, and do not require the techniques dis- 
cussed In this section. 

In some situations, the second gradient matrix is exactly singular for all values of x; two columns (and 
rows) are Identical or a column (and corresponding row) is zero. These simple singularities occur regularly 
even in complex nonlinear problems. They often result from errors in the problem formulation, such as minimiz- 
ing with respect to a parameter that Is irrelevant to the cost function. 

Ir the more general case, the second gradient is singular (or ill-conditioned) at some points but not at 
others. Whenever we use the term singular in the following discussion, we implicitly mean singular or ill- 
conditioned. Because of this definition, there will be vaguely defined regions of singularity rather than 
isolated points. The consequences of singularities are different depending on whether or not they are near 
the minimum. 

Singularities far from the minimum pose no basic theoretical difficulties. There are several practical 
methods for handling such singularities. One method is to use a gradient algorithm (or any other algorithm 
unaffected by such singularities) until x is out of the region of singularity. We can also use this method 
If the second gradient matrix has negative eigenvalues, whether the matrix Is Ill-conditioned or not. If the 
matrix has negative eigenvalues, the Newton-Raphson algorithm Is likely to behave poorly. (It could even con- 
verge to a local maximum.) The second gradient Is always positive semi -definite In a region around a local 
minimum, so negative eigenvalues are only a consideration away from the minimum. 

Another method o* handling singularities Is to add a small positive definite matrix to the second gradient 
before Inversion. We can also use this method to handle negative eigenvalues If the added matrix Is large 
enough. This method Is closely related to the previous suggestion of using a gradient algorithm. If the added 
matrix Is a large constant times an Identity matrix, the Newton-Raphson algorithm, so modified, gives a small 
step In the negative gradient direction. For small constants, the algorithm has characteristics between those 
of steepest descent and Newton-Raphson. The computational cost of this method Is high; In essence, we are 
getting performance like steepest descent w.ille paying the computational cost of Newton-Raphson. Even small 
additions to the second derivative matrix can dramatically change the convergence behavior of the Newton- 
Raphson algorithm. We should therefore discontinue this modification when out of the region of singularity. 

The advantage of this method Is its simplicity; excluding the test o. when the matrix Is Ill-conditioned, this 
modification can be done In two short lines of FORTRAN code. 

The last method Is to use a pseudo-inverse (rank-deficient solution! . Penrose (1955), Aokl (1967), 
Luenberger (1969), Wilkinson and Relnsch (1971), Moler and Stewart (1973), and Garbow, Boyle, Dongarra, and 
Holer (1977) discuss pseudo- Inverses In detail. The basic Idea of the pseudo-inverse method Is to Ignore the 
directions In the x-space corresponding to zero eigenvalues (within some tolerance) of the second gradient. 

In the parameter estimation context, such directions represent parameters, or combinations of parameters, about 
which the data give little Information. Lacking any information to the contrary, the method leaves such param- 
eter combination?! unchanged from their Initial values. 

The pseudo-inverse method does not address the problem of negative eigenvalues, but it Is popular In a 
large class of applications where negative eigenvalues are Impossible. The method Is easy tc Implement, being 
only a rewrite of the matrix-inversion or llnear-systetrt-solutlon subroutine. It also has a useful property 
absent from the other proposed methods; it does not affect the Newton-Raphson algorithm when the matrix is 
well -conditioned. Therefore one can freely apply this method without testing whether it Is needed. (It Is 
true that condition tests In some form are part of a pseudo- Inverse algorithm, but such tests are at a lower 
level contained within the pseudo- Inverse subroutine.) 
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Singularities near the minimum require special consideration. The excellent convergence of Newton* 
Raphson near the minimum Is the primary reason for using the algorithm. If significantly slow the conver- 
gence near the minimum, there Is little argument for using Newton -Raphson. The use of a pseudo-inverse can 
handle singularities while maintaining the excellent convergence; the pseudo- Inverse Is thus an appropriate 
tool for this purpose. 

Although pseudo-inverses handle the computational problems, singularities near the minimum also raise 
theoretical and application Issues. Such a singularity Indicates that the minimum point Is poorly defined. 

The cost function Is essentially Mat In at least one direction from the minimum, and the minimum value of the 
cost function might be attained to *■ chine accuracy by widely separated points. Although the algorithm con- 
verges to a minimum point, It might the wrong minimum point If the minimum Is flat. If the only goal Is to 
minimize the cost function, any minimizing point ml^t be acceptable. In the applications of this book, mini- 
mizing the cost .'unction Is only a means to an end; lie desired ou f ut Is the value of x. If r'ltlple solu- 
tions exist, the problem statement Is Incomplete or faulty. 

We strongly advise avoiding the routine use of pseudo-inverses or other computational machinations to 
"solve" uniqueness problems. If the basic problem statement Is faulty, no numerical trick will solve It. The 
pseudo-inverse works by changing the problem statement of the Inversion, adding the stipulation that the 
Inverse have minimum norm. The Interpretation of this stipulation Is vague In the context of the optimization 
problem (unless the cost function is quadratic, In which c?^e It specifies the solution nearest the starting 
poi t). If this stipulation is a reasonable addition to the problem statement, then the pseudo-inverse is an 
appropriate tool. This decision can have significant effects. For a nonquadratic cost function, for example, 
there might be large differences In the solution point, depending on small changes In the starting point, the 
data, or the algorithm. 

The pseudu-lnverse can be a good diagnostic tool for getting the Information vjeded to revise the problem 
statement, but one should not depend upon It to solve the problem autonomous 1 ^. lhe analyst's trong point is 
in formulating the problem; the computer's strength Is In crunching numbers to arrive at the solution. A 
failure In either role will compromise the validity of the solution. This statement s but a rephrasing of 
the computer cliche "garbage In, garbage out," which has been said many more times than It has been heard 

2.4.4 Quasi -Newton Methods 


Quasi-Newton methods are Intended for problems where explicit evaluation of the second gradient of the 
cost function Is complicated or costly, but the performance of the Newton-Raphson algorithm it desired. These 
methods form approximations to the second-gradient matrix using the first-gradient values from several Itera- 
tions. The approximation to the second gradient then substitutes for the exact second gradient In Equa- 
tion (2.4-4). Some of the methods directly form, approximations of the inverse of the second-gradient matrix, 
avoiding the cost and some of the problems of matrix Inversion. 

Note that as long as the approximation to the second-gradient matrix Is positive definite. Equa- 
tion (2.4-4) can never converge to any point with a nonzero first gradient. Therefore approximations to the 
second gradient, no matter how poor, cannot affect the solution point. The approximations can greatly change 
the speed of convergence and the area of acceptable starting values. Approximations to the first gradient 
would affect the solution point as well. 

The steepest descent method can be considered as the crudest of the quasi-Newton methods, u'ilng a constant 
times the Identity matrix as the approximation to the second gradient. The performance of the auasi-Newton 
methods approaches that of Newton-Raphson as the approximation to the second gradient improves. The 
Davldon-Fletcher-Powell method (variable metric method) is the most popular quasi-Newton method. See the 
references for discussions of these methods. 


2.5 SUMS OF SQUARES 


The algorithms discussed in the previous sections are generally applicable to any minimization problem. 

By tailoring algorithms to special characteristics of specific problem classes, we can often achieve far 
better performance than by using the general purpose algorithms. 

Many of the cost functions arising In estimation problems have the form of sums of squares. The general 

! sums -of- squares form Is 

i 

N 

j(x) • (2.5-1) 

1*1 

The f^ are vector-valued functions of x, and the Wi are weightings. To simplify some of the formulae, 
we assume that the Wj are symmetric. This assumption does not really restrict the application because we can 
always substitute 1/2 (Ws + Wj) for a nonsynanetrlc W{ without changing the function values. In most appli- 
cations, the W<| are positive semi -definite; this Is not a requirement, but we will see that It helps ensure 
that the stationary points encountered are local minima. The form of Equation (2.5-1) Is common enough to 
merit special study. 

The simulation sign In Equation (2.5-2) is somewhat superfluous In that any function In the form of Equa- 
tion (2.5-1) can be rewritten in an equivalent form without the summation sign. This can be done by concate- 
nating the N different ff(x) vectors Into a single, longer f(x) vector and making a corresponding large 
W matrix with the W^ wtrlces on diagonal blocks. The only difference Is In the notation. We choose the 
longer notation with the sumnatlon sign because It more directly corresponds with the way many parameter 
estimation problems are naturally phrased. 
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Several of the algorithms discussed In the previous two sections work well with the form of Equa- 
tion (2.5-1). For any reasonable fi functions, Equation (2.5-1) defines a cost function that Is well- 
approximated by quadratics over fairly large regions. Since many of the general minimization schemes are 
based on quadratic approximations, application of these schemes to Equation (2,5-1) is natural. This statement 
does not imply that there are never problems minimizing Equation (2.5-1); the problems are sometimes severe, 
but the odds of suc.ct .s with reasonable effort are much better than they are for arbitrary cost function forms. 
Although the general methods are usable, we can exploit the problem structure to do better. 


2.5.1 linear Case 

If the functions in Equation (2.5-1) are linear, then the cost function is exactly quadratic and we 
can express the minimum point in closed form. In particular, let the f^ be the arbitrary linear functions 

f.(x) = A.x + b 1 (2.5-2) 

Equation (2.5-1) then becomes 

N 

J(x) = Yj. + b i ]*W i [A.x + b^ (2.5-3) 

Equating the gradient of Equation (2.5-3) to 


2 


Solving for x gives 


x - - 


assuming that the inverse exists. 

If the inverse exists, then Equation (2.5-5) gives the only stationary point of Equation (2.5-3). This 
stationary point must be a minimum if all the W-j are positive semi -definite, and it must be a maximum if all 
the Wi a'*e negative semi -definite. (We leave the straightforward proofs as an exercise.) If the meet 
neither of these conditions, the stationary point can be a minimum, a maximum, or a saddle point. 

If the inverse in Equation (2.5-5) does not exist, then there is a line (at least) of solutions to Equa- 
tion (2.5-4). All of v ese points are stationary points of the cost function. Use of a pseudo- inverse will 
produce the solution with minimum norm, but this is usually a poor idea (see Section 2.4.3). 

2.5.2 Nonlinear Case 

If the f} are nonlinear, there is no simple, closed-form solution like Equation (2.5-5). A natural 
question in such situations, in which there is an easy method to handle linear equations, is whether we can 
merely linearize the nonlinear equations and use the linear methodology. Such linearization does not give an 
acceptable closed-form solution to tne current problem, but it does form the basis for an iterative method. 


I III i 

i=i 

zero gives 


rc V + b i ] * w i A i = ° 




(2.5-4) 


(2-5-5) 


Define the linearization of fi about any point Xj as 

f| j) (x) = A| j) x + bj j ) (2.5-6) 

where 

A i j), Vi (x j> (2 , 5— 7a) 

bj j) s fj(Xj) - (2.5-7b) 

Equation (2.5-5), with the a|^ and substituted for Aj and bi, gives the stationary point of the cost 

with the linearized f^ functions. This point is not, in general, a solution to the nonlinear proolem. If, 
however, Is close to the solution, then Equation (2.5-5) should give a point closer to the solution, 
because the linearization will give a good representation of the cost function in the region around xj. 


The iterative algorithm resulting from this concept Is as follows: First, choose a starting value x 0 . 
The closer x 0 Is to the correct solution, the better the algorithm is likely to work. Then define revised 
xj values by 




vfovi . 


(2.5-8) 
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2.5.2 


This equation comes from substituting Equation (2.5-7) into Equation (2.5-5) and simplifying. Iterate Equa 
tion (2.5-8) until it converges by some criterion, or until you give up. This method is often called quasi- 
linearization because it is based on linearization not of the cost function itself, but cf factors in the 
cost function. 

We made several vague, unsupported statements in the process of deriving this algorithm. We now need to 
analyze the algorithm's performance and compare it with the performance of the algorithms discussed in the 
previous sections. This task is greatly simplified by noting that Equation (2.5-8) defines a quasi-Newton 
algorithm. To show this, we can write the first and second gradients of Equation ( 2 . 5 - 1 ): 

N 

yw * 2 E [f i (x) i*v« f i (x) (2 - 5 ‘ 9) 

i=i 

N N 

vjj(x) = 2 £[y i (x)]*Wi( x >J + 2 Z [f i (x)]n, i V x f i (x) (2.5-10) 

i=i i=i 

(We have not previously introduced the definition of the second gradient of a vector, as ir the vjf^(x) 
above. The result is technically a tenser, but we will not need to consider it in detail here.) Comparing 
Equation (2.5-8) with Equations (2.4-4), (2.5-9), and (2.5-10), we see that the only difference between quasi - 
linearization and Newton-Raphson is that quasi-linearization has dropped the second term in Equation (2.5-10). 
Quasi-linearization is thus a quasi-Newton method using 

N 

yU) r 2 X^xV x ^*Wi (x) } (2-5-11) 

i = i 

as an approximation for the second gradient. The algorithm in this form is also known as Gauss-Newton, the 
term we will adopt in this book. 

Near the solution, the neglected term of the second gradient is generally small. Section 5.4.3 outlines 
this argument as it applies to the parameter estimation problem. Therefore, Gauss-Newton approaches the excel- 
lent performance of Newton-Raphson near the solution. Such approximation is the main goal of quasi-Newton 
methods. 

Accurately approximating the performance of Newton-Raphson far from the minimum is not of great concern 
because Newton-Raphson does not generally perform well in regions far from the minimum. We can even argue tha* 
Gauss-Newton sometimes performs better than Newton-Raphson far from the minimum. The worst problems with 
Newton-Raphson occur when the second gradient matrix has negative eigenvalues; Newton-Raphson can then go in 
the wrong direction, possibly converging to a local maximum or diverging. If all of the Wj are positive 
semi -definite (which is usually the case), then the second gradient approximation given by Equation (2.5-11) 
is positive semi-Jefimte for all x. A positive semi-definite second gradient approximation does not guaran- 
tee good behavior, but it surely helps; negative eigenvalues virtually guarantee problems. Thus we can heuris- 
tically argue that Gauss-Newton should perform better than Newton-Raphson. We will not attempt a detailed 
support of this general argument in this book. In several specific cases the improvement of Gauss-Newton over 
Newton-Raphson is easily demonstrable. 

Although Gauss-Newton sometimes performs better than Newton-Raphson far frcm the solution, it has many of 
the same basic start-up problems. Both algorithms exhibit their best performance near the minimum. Therefore, 
we will often need to begin with some other, more stable algorithm, changing to Gauss-Newton as we near the 
minimum. 

The real argument in favor of Gauss-Newton over Newton-Raphson is the lower computational effort and com- 
plexity of Gauss-Newton. Any performance improvement is a coincidental side benefit. Equation (2.5-11) 
involves only first derivatives of fj(x). These first derivatives are also used in Equation (2.5-9) for the 
first gradient of the cost. Therefore, after computing the first gradient of J» the only significant computa- 
tion remaining for the Gauss-Newton approximation is the matrix multiplication in Equation (2.5-11). The com- 
putation of the Gauss-Newton approximation for the second gradient can sometimes take less time than the compu- 
tation of the first gradient, depending on the system dimensions. For complicated fi functions, evaluation 
of the vjfi(x) in Equation (2.5-10) is a major portion of the computation effort of the full Newton-Raphson 
algorithm. Gauss-Newton avoids this extra effort, obtaining the performance per iteration of Newton-Raphson 
(if not better in some areas) with computational effort per iteration comparable to gradient methods. 

Considering the cost of the one-dimensional searches required by gradient methods, Gauss-Newton can even 
be cheaper per iteration than gradient methods. The exact trade-off depends on the relative costs ot evaluat- 
ing the f-j and their gradients, and on the typical number of evaluations required in the one -dimensional 
searches. Gauss-Newton is at its best when the cost of evaluating the fi is nearly as much as the cost of 
evaluating both the fi and their gradients due to high overhead costs conmon to both evaluations. This is 
exactly the case in some aircraft applications, where the overhead consists largely of dimensionalizing the 
derivatives and building new system matrices at each time point. 

The other quasi-Newton methods, such as Davidon-Fletcher-Powell , also approach Newton-Raphson performance 
without evaluating the second derivatives of the f^. These methods, however, do require one-dimensional 
searches. Gauss-Newton stands almost alone In avoiding both second derivative evaluations and one-dimensional 
searches. This performance is difficult to match in general algorithms that do not take advantage of the 
special structure of the cost function. 
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Some analysts (Foster, 1983) introduce one-dimensicnal line searches into the Gauss-Newton algorithm to 
improve its performance. The utility of this idea depends on how well the Gauss-Newton method is performing. 
In most of our experience, Gauss-Newton works well enough that the one-dimensional line searches cannot mea- 
surably improve performance; the total computation time can well be larger with the line searches. When the 
Gauss-Newton algorithm is performing poorly, however, such line searches could help stabilize it. 

For cost functions in the form of Equation (2.5-1), the cost/performance ratio of Gauss-Newton is so much 
better than that of most other algorithms that Gauss-Newton is the clearly preferred algorithm. You may want 
to modify Gauss-Newton for specific problems, and you will almost surely need to use seme special start-up 
algorithm, but the best methods will be based on Gauss-Newton. 


2.6 CONVERGENCE IMPROVEMENT 

Second-order methods tend to converge quite rapidly in regions where they work well. There is usually 
such a region around the minimutf, point; the size of the region is problem- dependent. The price paid for this 
region of excellent convergence is that the second-order methods often corverge poorly or diverge in regions 
far from the minimum. Techniques to detect and remedy such convergence problems are an important part of the 
practical implementation of second-order methods. In this section, we briefly list a few of the many conver- 
gence improvement techniques. 

Modifications to improve the behavior of second-order methods in regions far from the minimum almost 
inevitably slow the convergence in the region near the minimum. This reflects a natural traCe-off between 
speed and reliability of convergence. Therefore, effective implementation of convergence -improvement tech- 
niques usually includes different treatment of regions far from the minimum and near the minimum. 

In regions far from the minimum, the second-order methods are modified or abandoned in favor of more con- 
servative algorithms. In regions near the minimum, there is a transition to the fast second order methods. 

The means of determining when to make such transitions vary widely. Transitions can be based on a simple 
iteration count, on adaptive criteria which examine the observed convergence behavior, or on other principles. 
Transitions can be either gradual or step changes. 

Some convergence improvement techniques abandon second-order methods in the regions far from the minimum, 
adopting gradient methods instead. In our experience, the pure gradient method is too slow for practical use 
on most parameter estimation problems. Accelerated gradient methods such as PARTAN and conjugate gradient are 
reasonable possibilities. 

Other convergence improvement techniques are modifications of the second-order methods. Many convergence 
problems relate to ill-conditioned or nonpositive second gradient matrices. This suggests such modifications 
as adding positive definite matrices to the second gradient or using rank-deficient solutions. 

Constraints on the allowable range of estimates or on the change per iteration can also have stabilizing 
effects. A particularly popular constraint is to fix some of the ordinates at constant values, thus reducing 
the dimension of the optimization problem; this is a form of axial iteration, and its effectiveness depends on 
a wise (or lucky) choice of the ordinates tc be constrained. 

Relaxation methods, which reduce the indicated parameter changes by some fixed percentage, can sometimes 
stabilize oscillating behavior of the algorithm. Line searches in the indicated direction extend this concept 
and should capable of stabilizing almost any problem, at the cost of additional function evaluations. 

The above list of convergence improvement techniques is far from complete. It also omits mention of 
numerous important implementation details. This list serves only to call attention to the area of convergence 
improvement. See the references for more thorough treatments. 



Figure (2.0-1). Illustration of local and global minima. 






Figure (2.2-1). Behavior of axial iteration. 
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Figure (2.3-2). The gradient direction near 
a narrow valley. 



Figure (2.2-2). The pattern direction. Figure (2.3-3). Behavior of the gradient algorithm 

in a narrow valley. 




Figure (2.3-1). The gradient direction from a 
circular isocline. 
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Figure (2.3-4). Worse behavior of the gradient 
algorithm. 
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CHAPTER 3 


3.0 BASIC PRINCIPLES ROM PROBABILITY 

In this chapter we will review some basic definitions and results from probability theory. We presume 
that the reader has had previous exposure to this material. Our aim here is to review and serve as a refer- 
ence for those concepts that are used extensively in the following chapters. The treatment, therefore, is 
quite abbreviated, and devotes little time to motivating the field of study or philosophizing about the 
results. Proofs of several of the statements are omitted. Some of the other proofs are merely outlined, with 
some of the more tedious steps omitted. Apostol (1969), Ash (1970), and Papoulis (1965) give more detailed 
treatment. 


3.1 PROBABILITY SPACES 

3.1.1 Pr obability Triple 

A probability space is formally defined by t>ree i*"Mns (si,B,P), sometimes called the probability triple, 
n is called the sample space, and the elements u oi u are railed outcomes or realizations. 6 is a set of 
sets defined on 8, closed under countable set opera Lions (union, intersection, and complement) . Each set 
B c 6 is called an event. In the current discussion, we will not be concerned w*th th-* r ine details o f 
definite of 6. B is referred to as the class of measurable sets and is studied in measure theory (hoyden, 
1558; Rudm, 1974). P is a scalar valued function defined on 8, and is called the probability function or 
prooatility measure. For each set B in b, the function P(B) defines the probability that u will be in B. 
P wst satisfy the following axioms: 

1) 0 < P(B) i 1 for all Bee 

2) P(P) - 1 

3) P^52 = 52 P(B,j) for all countable sequences of disjoint B- e b 

3.1.2 Conditional Probabilities 


If A and B are t*o events and P(B) f 0, the conditional probability of A aivon B is defined as 

P(A|B) = P (A | B)/P (B) (3.1-1) 

where AjB is the set intersection of the events A and B. 

The events A and B are statistically independent if P(A[6) = P(A). Note that this conditio., is sym- 
metric; that is, if P( A | B) * P(A), then P(B|A) = P(B), provided that P ( A | B) and P(B|A) are both defined. 


3.2 SCALAR RANDOM VARIABLES 

A scalar real-valued function XU) defined on 8 is called a random variable if the set {w:X(u>) < x) is 
in b for all real x. 

3.2.1 Distribution and Density Functions 

Every random variable has a distribution function defined as follows: 

F x (x) - P({u>:X(w) < x}) (3.2-1) 

It follows directly from the properties of a probability measure that Fy(x) must be a nondecreasing function 
of x, with Fx(-«) = 0 and Fy (°°) - 1. By the Lebesque decomposition lenma (Royden, 1968, p. 240; Rudin, 
1974, p. 129), any distribution function can always be written as the sum of a differentiable component and a 
component which is piecewise constant with a countable number of discontinuities. In many cases, we will be 
concerned with variables with differentiable distribution functions. For such random variables, we define a 
function, Px($), called the probability density function, to be the derivative of the distribution function: 

P x (*> * ^ F x (x) (3.2-2) 

We have also the inverse relationship 

F x (x) = f p x (s)ds (3.2-3) 

* -CD * 

A probability density function must be nonnegative, and its integral over the real line must equal 1. For 
simplicity of notation, we will shorten px(s) to p ( x ) where the meaning is clear. Where confusion is 

possible, we will retain the longer notation. 

A probability distribution can be defined completely by giving either the distribution function or the 
density function. We wi 1 ! work mainly with density functions, except when they are not defined. 



24 

3.2.2 Expectations and Homenis 


3.2.2 




The expected value of a random variable, X, is defined by 

E{X> * j xp x (x)dx (3.2-4) 

If X does not have a density function, the precise definition of the expectation is somewhat more technical, 
involving a Stieltjes integral; Equation (3.2-4) is adequate for the needs of this document. The expected 
value is also called the expectation or the mean. Any (measurable) function of a random variable is also a 
random variable and 

E{f(X)} f f(x)p x (x)dx (3.2-5) 

The expected value of X n for positive n is called the nth moment of X. Under mild conditions, knowledge 
of all of the moments of a distribution is sufficient to define the distribution (Papoulis, 1965, p. 158). 

The variance of X is defined as 

var(X) = E{ (X - E{X}) 2 } 

= E{X 2 } + E{X} 2 - 2E{X}£{X} 

= E{X 2 } - E{X} 2 (3.2-6) 

The standard deviation is the square root of the variance. 


3.3 JOINT RANDOH VARIABLES 

Two random variables defined on the same sample space are called joint random variables. 

3.3.1 Distribution and Density Functions 

If two random variables, X and Y, are defined on the same sample space, we define a joint distribution 
function of these variables as 


f x>Y (x,y) s *(«) < Y >) (3.3-1) 

for absolutely continuous distribution functions, a joint probability density function Px,v( x *y) defied 
by the partial derivative 

p x>Y (x,y) * jfjy F x,y( x,y ) (3.3-2) 

We then have also 

F x>Y (x,y) » j P x>Y (s,t)dt ds (3.3-3) 

In a similar manner, joint distributions and densities of N random variables can be defined. As in the 
scalar case, the joint density function of N random variables must be nonnegative and its Integral over the 
entire space must equal 1. 

A random N-vector is the same as N jointly random scalar variables, the only difference being in the 
terminology. 

3.3.2 Expectations and Moments 

The expected value of a random vector X is defined as in the scalar case: 

E{X} * Jf xp x (x)ds (3.3-4) 


The covariance of X Is a matrix defined by 

cov(X) = E{[X - E(X)][X - E(X)]*} 

* E{XX*} - E{X}E{X>* (3.3-5) 

The covariance matrix Is always symmetric and positive seml-deflnlte. It Is positive definite If X has a 
density function. Higher order moments of random vectors can be defined, but are notatlonally clumsy and 
seldom used. 

Consider a random vector Y given by 


Y * AX + b 


(3.3-6) 


«r 
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where A Is cny deterministic matrix (not necessarily square), and b is an appropriate length deterministic 
vector. Then the mean and covariance of Y are 

E{Y) * E(AX + b> « AE{X) + b (3.3-7) 

cov(Y) * E{[Y - E(Y)][Y - E(Y)]*} 

* Et [AX + b - AE(X) - b] [AX + b - AE(X) - b]*> 

« AEi[X - E(X)][X - E(X)]*} 

* A cov(X)A* (3.3-8) 

3.3.3 Harginal and Conditional Distributions 

If X and Y are jointly random variables with a joint distribution function given by Equation (3.3-1), 
then X and Y are also individually random variables, with distribution functions defined as in Equa- 
tion (3.2-1). The Individual distributions of X and Y are called the marginal distributions, and the corre- 
sponding density functions are called marginal density functions. 

The marginal distributions of X and Y can be derived from the joint distribution. (Note that the con- 
verse is false without additional assumptions.) By comparing Equations (3.2-1) and (3.3-1), we obtain 

F x00 s F x#y (x.-) (3.3-9a) 


and correspondingly 


F y (y) * F x ^y^iy) 

In terms of the density functions, using Equations (3.2-2) and (3.3-3), we obtain 


(3.3-9b) 


P 


X 



P x> y( x *y) dy 


(3.3-10a) 


Py(y) = J P Xt y(x,y)dx (3.3-10b) 


The conditional distribution function of X given Y is defined as (see Equation (3.1-1)) 

F X |y(x|y) = P((w:X(u») < x) 1 (w:Y(u>) < y}) (3.3-11) 

and correspondingly for Fy| X . The conditional density function, when it exists, can be expressed as 

P X |y(x|y) s Py ^ y ( x ,y )/Py (y ) (3.3-12) 

Equation (3.3-12) is known as Bayes* rule. 

The conditional expectation is defined as 


EIX|Y} = £ xp X | Y (x|y)dx (3.3-13) 


assuming that the density function exists. Using Equation (3.3-13), we obtain the useful decomposition 

E(f (X,Y) ) * E{E(f(X,Y)|Y)> (3.3-14) 

3.3.4 Statistical Independence 

Two random vectors X and Y defined on the same probability space are defined to be independent if 

F XfY (x,y> * F x (x) F y(y) (3.3-15) 

If the joint probability density function exists, we cai. write this condition as 

P Xi y(x.y) * p x (x)p y (y) (3.3-16) 

An immediate corollary, using Equation (3.3-12), Is that px I y does not depend on y, and pyj X does not 
depend on x. If X and Y ere independent, then f(X) and g(Y) are Independent for any functions f and g. 

Two vectors are uncorrelat^d If 

EIXY*} * t{X}E{Y*} (3.3-17) 


■ % 


A 



or equivalently if 


E{ (X - E{X} ) (Y - E{ Y} )*} * 0 (3.3-18) 

If X and Y are uncorrelated, then the covariance of their sum equals the sum of their covariances. 

cov(X + Y) = cov(X) ♦ cov(Y) (3.3-19) 

If two vectors are independent, then they are uncorrelated, but the converse of this statement $ false. 


3.4 TRANSFORMATION OF VARIABLES 

A large part of probability theory is concerned in some manner with the transformation of variables; i.e., 
characterizing random variables defined as functions of other random variables. We have previously cited 
limited results on the means and covariances of some transformed variables (Equations (3.2-5), (3.3-7), and 
(3.3-8)). In this section we seek the entire density function. Our consideration is restricted to variables 
that have density functions. Let X be a random vector with density function py (x ) defined on R n , the 
Euclidean space of real n-vectors. Then define Y e Rm by Y = f(X). We seek to derive the density func- 
tion of Y. There are three cases to consider, depending on whether m = n, m > n, or m < n. 

The primary case of interest is when m = n. Assume that f(*) is invertible and has continuous partial 
derivatives. (Technically, this is only required almost everywhere.) Define g ( Y ) = f" 1 (Y). Then 

Py (y ) = P x (g(y))|det(J)| (3.4-1) 


where 0 is the Jacobian of the transformation g 

ag Ax) 

J ij = ^7~ 

See Rudin (1974, p. 186) and Apostol (1969, p. 394) for the proof. 

Example 3.4-1 Let Y - CX, with C square and nonsingular. Then g(y) * C" L y 
and J = C" 1 , giving 

P Y (y) = P x (c* 1 y) I detfc^ 1 ) | 

as the transformation equation. 


(3.4-2) 


If f is not invertible, the distribution of Y is given by a sum of terms similar to Equation (3.4-1). 

For the case with m > n, the distribution of Y will be concentrated on, at most, an n- dimensional 
hypersurface in R^,, and will not have a density function in R^. 

The simplest nontrivial case of m < n is when Y consists of a subset of the elements of X. In this 
case, the density function sought is the density function of the marginal distribution of the pertinent subset 
of the elements of X. Marginal distributions were discussed in Section 3.3.3. In general, when m < n, 

X can oe transformed into a random vector Z e such that Y is a subset of the elements of Z. 


Example 3.4-2 Let X e R 2 and Y * + X 2 . Define Z = CX where 



Then using example 3.4-1, 

p z (z) « P X (C _1 ) |det(C -1 )| 
■ | P x (C* 1 z) 

where 



Then Y * Z^ so the distribution of Y is the marginal distribution of Z lt which can be computed from 
Equation (3.3-10), 


3.5 GAUSSIAN VAR T ABLES 

Random variables with Gaussian distributions play a major role in this document and in much of probability 
theory. We will, therefore, briefly review the definition and some of the salient properties of Gaussian dis- 
tributions. These distributions are often called normal distributions In the literature. 


3.5.1 Standard Gaussian Distributio ns 


All Gaussian distributions derive from the distribution of a standard Gaussian variable with mean 0 and 
covariance 1. The density function of the standard Gaussian distribution is defined to be 

p(x) * (2ir)~ 1 / 2 exp^- | (3.5-1) 

The distribution function does not have a simple closed-form expression. We will first show that Equa- 
tion (3.5-1) is a valid density function with mean 0 and covariance 1. The most difficult part is showing 
that its integral over the real line is 1. 

Theorem 3.5-1 Equation (3.5-1) defines a valid probability density function. 

Proof The function is obviously nonnegative. There remains only to show 
that its integral over the real line is 1. Taking advantage of the symmetry 
about 0, we can reduce this problem to proving that 


J>4H 


dx = JttTj 


There is no closed-form expression for this integral over any finite range* 
but for the semi -infinite range of Equation (3.5-2) the following "trick" 
works. Form the square of the integral: 


"l X” exp [ _ ^ <x2 + y2) ] dx 


dy 


Then change variables to polar coordinates, substituting r 2 for x 2 + y 2 
and r dr de for dx dy, to get 


r / \ 

12 W2" / , \ 


x = J f r exp(- { r 2 ) 



dr de 

The integral in Equation (3.5-4) has a closed-form solution: 

J* r exp^- | r 2 ^dr = -exp^- ^ r 2 ^| = 0 - (-1) = 1 


Thus, 


[r-H-HT-r- 


(3.5-2) 


(3.5-3) 


(3.5-4) 


(3.5-5) 


(3.5-6) 


Taking the square root gives Equation (3.5-2), completing the proof. 

The mean of the distribution is trivially zero by symmetry. To derive the covariance, note that 

E{1 - X 2 } * j (1 - x 2 )(2n)' l/2 exp^- | x^dx * (2*)' l/2 x exp^- | x 2 ^| = 0 (3.5-9) 


Thus, 


cov(X) = E{X 2 } - E{X} 2 =1-0=1 (3.5-10) 

This completes our discussion of the scalar standard Gaussian. 

We define a standard multivariate Gaussian vector to be the concatenation of n independent standard 
Gaussian variables. The standard multivariate Gaussian density function is therefore the product of n 
marginal density functions in the form of Equation (3.5-1). 


p(x) = (2tt)" i/2 exp^- | x 2 j 

3 (2ir)" n/2 exp^- j x*x^ 


The mean of this distribution is 0 and the covariance Is an identity matrix. 
3.5.2 General Gaussian Distributions 


(3.5-11) 


We will define the class of all Gaussian distributions by reference to the standard Gaussian distributions 
of the previous section. We define a random vector Y to have a Gaussian distribution "if Y can be repre- 
sented in the form 





\ ^ 
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Y * AX + m (3.5-12) 

where X is a standard Gaussian vector, A is a deterministic matrix and m is a deterministic vector. The 
A matrix need not be square. Note that any deterministic vector is a special case of a Gaussian vector with 
a zero A matrix. 

We have defined the class of Gaussian random variables by a set of operations that can produce such 
variables. It now remains to determine the forms and properties of these distributions. (This is somewhat 
backwards from the most corunon approach, where the forms of the distributions are first defined and Equa- 
tion (3.5-12) is , roven as a result. We find that our approach makes it somewhat easier to handle singular 
and nonsingular cases consistently without introducing characteristic functions (Papoulis, 1965). 

By Equations (3.3-7) and (3.3-8), the Y defined by Equation (3.5-12) has mean m and covariance AA*. 
Our first major result will be to show that a Gaussian distribution Is uniquely specified by its mean and 
covariance; that Is, if two distributions are both Gaussian and have equal means and covariances, then the two 
distributions are identical. Note that this does not mean that the A matrices need to be Identical; the 
reason the result is nontrivial is that an Infinite number of different A matrices give the same covariance 
AA*. 


Example 3.5-1 Consider three Gaussian vectors 
and 


ll 

To. 707 

-0.70/1 

oJ x - 

Y 2 * 

LO . 707 

0.707j 


* P ° ° l* 

Lo 0.866 0 . 5| 


where and X 2 are standard Gaussian 2-vectors and X 3 is a standard 
Gaussian 3-vector. We have 


■ [" I H 3 

[0.707 -0.70/1 ("0 . 1 

Lo.707 0.707J t0 . 1 

■ p 0 “l 

Lo 0.866 o.y 


cov(Yj) 


.707 
.707 

'I tfl 

0.866 C 
.0.5 0. 

Thus all three Yj have equal covariance. 


cov(Y.) 


“•'"'l.r* °i 

0.707J LO J 

n 


The rest of this section is devoted to proving this result in three steps. First, we will consider 
square, nonsingular A matrices. Second, we will consider general square A matrices. Finally, we will 
consider nonsquare A matrices. Each of these steps uses the results of the previous step. 


Theorem 3.5- 2 If Y is a Gaussian n-vector defined by Equation (3.5-12) 
with a nonsingular A matrix, then the probability density function of Y 
exists and is given by 

P(y) * iPTTAi' 1 /* exp£ j (y - m)*A’ 1 (y - m)J (3.5-13) 
where A is the covariance AA*. 


Proof This Is a direct application of the transformation of variables for- 
muTa, Equation (3.4-1). 

Py (y) * P x [A _1 (y - m)] I A* 1 1 

■ (2„r n/2 exp{- \ [A- My - m)]*[A'My - m)]}|Ar 

1 j 2 itAA*[ - 1 / 2 exp£- -j (y - m)*(AA*)" 1 (y - m)j 

Substituting A for AA* then gives the desired result. 

Note that the density function, Equation (3.5-13), depends only on the mean and covariance, thus proving 
the uniqueness result for the case restricted to nonsingular matrices. A particular case of Interest 1$ where 
m Is 0 and A Is unitary. (A unitary matrix Is a square one with AA^ * I.) In this case, Y has a standard 
Gaussian distribution. 




3.5.2 


29 


Theorem 3.5-3 If Y Is a Gaussian n-vector defined by Equation (3.5-12) 
with any square A matrix, then Y can be represented as 

Y «■ SX + m (3.5-14) 

where X Is a standard Gaussian n-vector and S Is positive semi -definite. 

Furthermore, the S in this representation is unique and depends only on 
the covariance of Y. 

Proof The uniqueness is easy to prove, and we will do It first. The covari- 
ance of the Y given In Equation (3.5-12) Is AA*. The covariance of a Y 
expressed as in Equation (3.5-14) is SS*. A necessary (bu not sufficient) 
condition for Equation (3.5-14) to be a valid representation of Y is there- 
fore, that SS* equal AA*. It is an elementary result of linear :Igebra 
(Wilkinson, 1965; Dongarra, Holer, Bunch, and Stewart, 1979 , cha Strang, 

1980) that AA* is always positive semi -definite and that there is one and 
only one positive semi-definite matrix S satisfying SS* * AA*. S is 
called the matrix square root of AA*. This proves the uniqueness. 

The existence proof relies on another result from linear algebra: any square 
matrix A can be factored as SQ, where S is positive semi -definite and Q 
is unitary. For nonsingular A, this factorization i? easy-S is the matrix 
square root of AA* and Q is S^A. A formal proof for general A matrices 
would be too long a diversion into linear algebra for our current purposes, 
so we will omit it. This factorization is closely related to, and can be 
formally derived from, the well-known QR factorization, where Q is unitary 
and R is upper triangular (Wilkinson, 1965; Dongarra, Moler, Bunch, and 
Stewart, 1979; and Strang, 1980). 

Given the SQ factorization of A, define 

X * QX (3.5-15) 

By theorem (3.5-2), % is a standard Gaussian n-vector. Substituting into 
Equation (3-5-12) gives Equation (3.5-14), completing the proof. 

Because the S in the above theorem depends cnly on the covariance of Y, it immediately follows that 
the distribution of any Gaussian variable generateo by a square A matrix is uniquely specified by the mean 
and covariance. It remains only to extend this result to rectangular A matrices. 

Theorem 3.5-4 The distribution of any Gaussian vector is uniquely defined 
Fy Its mean and covariance. 

Proof We have already shown the result for Gaussian vectors generated by 
square A matrices. We need only show that a Gaussian vector generated by 
a rectangular A matrix can be rewritten in terms of a square A matrix. 

Let A be n-by-m, and cons^cr "he two cases, n > m and n < m. If n < m, 
define a standard Gaussian n-vector X by augmenting the X vector with 
n - m independent standard Gausslans. T hen define an n-by-n matrix A by 
augmenting A with n - m rows of zeros. We then have 

Y * AX + m 


as desired. 

For the case n < m, define a random m-vector Y by augmenting Y with 
m • n zeros. Then 


Y * AX + m 

where m and A are obtained^by augmenting zeros to m and A. Use 
Theorem (3.5-3) to rewrite Y as 


. - SX + m 


(3.5-16) 


Since the last m - n elements of Y are zero, Equat;jn (3.5-16) must be 
in the form 

Q-C It • 3 

Thus 


Y * SX + m 


which is In the required form. 

Theorem (3.5-4) Is the central result of this approach to Gaussian variables. It makes the practical 
manipulation of Gaussian variables much easier. Once you have demonstrated that some result is Gaussian, you 


** ***** 
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need only derive the mean and covariance to specify the distribution completely. This Is far easier than 
manipulating the full density .unction or distribution function, a process which often requires partial differ- 
ential equations. If the covariance matrix is nonsingular, then the density function exists and Is given by 
Equation (3.5-13). If the covariance is singular, a density function does not exist (unless you extend the 
definition of density functions to include components like impulse functions). 

Two properties of the Gaussian density function often provide useful computational shortcuts to evaluating 
the mean and covariance of nonsingular Gausslans. The first property Is that the mean of the density function 
occurs at Its maximum. The mean is thus the unique solution of 

v y in p(y) ■ 0 (3.5-17) 

The logarithm In this equation can be removed, but the equation is usually most useful as written. The second 
property Is that the covariance can be expressed as 

cov(Y) - -IV in p(y)]" 1 (3.5-18) 

Both of these properties are easy to verify by direct substitution into Equation (3.5-13). 

3.5.3 Properties 

In this section we derive several useful properties of Gaussian vectors. Most of these properties relate 
to operations on Gaussian vectors that give Gaussian results. A major reason for the wide use of Gaussian 
distributions is that many basic operations on Gaussian vectors give Gaussian results, which can be character- 
ized completely by the mean and covariance. 

Theorem 3.5-5 If Y is a Gaussian vector with mean m and covariance A, 
and if l fs given by 


Z = BY + b 

then Z is Gaussian with mean Bm + b and covariance BAB*. 

Proo f By definition, Y can be expressed as 

Y = AX ♦ m 

where X is a standard Gaussian. Substituting Y into the expression for 
Z gives 


Z * B ( AX + m) ♦ b * BAX + (Bm + b) 

proving that Z is Gaussian. The mean and covariance expressions for linear 
operations on any random vector were previously derived in Equations (3.3-7) 
and (3.3-8). 

Several of the properties discussed In this section involve the concept of jointly Gaussian variables. 

Two or more random vectors are said to be jointly Gaussian If their joint distribution Is Gaussian. Note that 
two vectors can both be Gaussian and yet not be jointly Gaussian. 

Example 3.5-2 Let Y be a Gaussian random variable with mean 0 and 
variance 1. Define Z as 


7 A Y -1 S Y S 1 

l-Y el sewhere 

The random variable Z is Gaussian with mean 0 and variance 1 (apply Equation (3.4-1) to show this), 
but Y and Z are not jointly Gaussian. 


theorem 3.5-6 Let Y : and Y 2 be jointly Gat Man vectors, and let the mean 
and covariance of the joint distribution be petitioned as 






A 


12 


A 


22 


Then the marginal distributions of 
E{Y,) * ntj 
E{Y 2 ) * m 2 


Yj and Y 2 are Gaussian with 
COvtYj) = A u 
COv(Yj) = Ajj 


Proof Apply theorem (3.5-5) with B ■ [1 0] and E * [0 1]. 

The following two theorems relate to Independent Gaussian variables: 

Theorem 3.5-7 If Y and Z are two independent Gaussian variables, then Y and Z are jointly 
Gaussian. 
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Proof For nonsingular distributions, this proof Is easy to do by writing out 
the product of the density functions. For a more general proof, we can 
proceed as follows: write Y and Z as 

Y « AjX 2 ♦ nij 

Z * A 2 X 2 + m 2 


where X x and X 2 are standard Gaussian vectors. We can always construct the 
X 2 and X 2 In these equations to be Independent, but the following argument 
avoids the necessity to prove that statement. Define two inde standard 

Gaussians, and fc 2 , aT> d further define 


Y - A^ + ri x 
2 ■ A 2 X 2 + ni 2 

Then Y and_2 have the same joint distribution as Y and l. The concatenation 
of K, and X 2 is a standard Gaussian vector. Therefore, V and 2 are jointly 
Gaussian because they can be expressed as 



Since Y and Z have the same joint distribution as Y and Z, Y and Z are also 
jointly Gaussian. 


Theorem 3.5-8 If Y and Z are two uncorrelated jointly Gaussian ‘.ariables, 
then 7 -and 1 are independent and Gaussian. 

Proof By theorem (3.5-3), we can express 

[z] * sx + m 

where X Isa standard Gaussian vector and S is positive seml-def Inlte. 
Partition S as 



By the definition of "uncorrelated," we must have S l2 * $*i * 0. Therefore, 
partitioning X Into X 2 and X 2 , and partitioning m Into m 2 and m 2 , we 
can write 

Y * ^ m 2 

Z * S 22 X 2 ♦ m 2 

Since Y and Z are functions of the Independent vectors X 2 and X 2 , Y and Z 
are Independent and Gaussian. 

Since any two Independent vectors are uncorrelated. Theorem (3.5-8) proves that Independence and lack of 
correlation are equivalent for Gaussians. 


We previously covered marginal distributions of Gaussian vectors. The following theorem considers condi- 
tional distributions. We wlU directly consider only conditional distributions of nonsingular Gaussians. 

Since the results of the theorem involve inverses, there are obvious difficulties that cannot be circumvented 
by avoiding the use of probability density functions in the proof. 


Theorem 3.5-9 Let Y t and Y, be jointly Gaussian variables with a rtonslngu- 
lar joint distribution. Partition the mean, covariance, and Inverse covariance 
of the joint distribution as 



Then the conditional distributions of Y* given Y 2 , and of Y 2 given Y x , are 
Gaussian with means and covariances 


EtYjlY,) ■ m, + A„A - m,) 

(3.5-18,) 

COvEYjlYj) ■ A,J - AjjAjjAjj * (rji) 1 

(3.5-18b) 

EtYjlYj) • in, + AjjA'Jty, - «■») 

(3. 5- 19,) 

C0V{Y 2 |Yj} • - Aj,Aj }A l: * (Eu) * 

(3 . 5- 19b ) 



Proof Tho joint probability density function of Y 2 and Y 2 Is 

p(y 2 .y 2 ) 


f l r yi - ", r r, y 2 - *, 1 

f r 

yj • >21 T 22 y 2 - m 2 J 


where c 1$ a scalar constant, the magnitude of which we will not need to 
compute. Expanding the exponent, and recognizing that r 21 * r 12 *, gives 


P(yi.y 2 ) ■ c exp£- | (y 2 - m^r^ly, - m,) 

- 7 (y 2 - m a)*r J2 (y 2 - <" 2 ) - (y 2 - Hj)*r l2 (y 2 - m 2 j] 


Completing squares results In 

p(y x *y 2 ) • C exp|- | [y x - m x + r;|r 12 (y 2 - m 2 )]*r u [y 1 - m 2 + r*Jr 12 (y 2 - m 2 )] 

7 ” m 2^*^22 “ ^2l^llTl2 ^^2 " m 2 ^ 

Integrating this expression with respect to y x gives the marginal density 
function of Y 2 . The second term in the exponent does not involve y 1# and 
we recognize the first tern as the exponent In a Gaussian density function 
with mean m x - r“ Jr (y, - m 2 ) and covariance r l2 . Its Integral with 
respect to y, is therefore a constant Independent of y 2 . The marginal 
density function of Y 2 Is therefore 

p(y 2 ) * C 2 exp[- £ (y 2 - rtij)*(r 22 - r 21 rjfr 12 )(y 2 - m 2 )J 

where c 2 Is a constant. Note that because we know that Equation (3 - 5-22 ) 
must n- * probability density function, we need not comoute the value of c 2 ; 
this saves us a lot of work. Equation (3.5-22) Is an expression for a 
Gaussian density function with mean m 2 and covariance (i 22 - r 21 rj£r l2 )~*. 

The partitioned matrix Inversion lemma (Appendix A) gives us 

(r 22 - • 22 

thus Independently verifying the result of Theorem (3.5-6) on the marginal 
distribution. 


The conditional density of Y given Y 2 Is obtained using Baves* rule, by 
dividing Equation (3.5-21) by Equation (3.5-22) 


p(y 2 |y 2 ) 


p(y 2 .y 2 ) 
p(y 2 ) ' 

Cj ® x p^- Cy 2 * m i + r.* 2 r I2 (y 2 ~ ** 2 )]*y 22 [y 2 * m 2 + ^ 2 J r 2 2 (y 2 - ro 2 )]^ 


where c x Is a constant. This is an expression for a Gaussian density 


functlon'wlth a mean - rjjr, 2 (y 2 - ■ 
Honed matrix Inversion lemma (Appendix 


■»! 


and covariance 
then gives 


The parti - 


r" 1 

r ai 


- A^A^As 


r *r 
1 1 1 1 1 


A 12 A 22 


Thus the conditional distribution of Yj given Y 2 Is Gaussian with mean 


+ A l2 Ari(y 2 - m 2 ) and covariance A 2l - 
The conditional distribution of V 2 given Y/'foTlows by symmetry. 

The final result of this section concerns sums of Gaussian variables. 

Theorem 3.5-10 If Y, and Y, are jointly Gaussian random vectors ul equal 
length amHEKelr joint distribution has mean and covariance partitioned as 


.\i 2 a 2 |a 
Y x foil 


as we desired to pr^ve. 


Then Yj ♦ Y 2 is Gaussian with mean m x ♦ m 2 and covariance 

A 11 * *2* + A l* + A 2X’ 

Proof Apply Theorem (3.5-5) with B ■ [I I] and b * 0. 


(3.5-20) 


(3.5-21) 


(3.5-22) 


(3.5-23) 


A simple summary of this section Is that linear operations on Gaussian variables give Geussian results. This 
principle Is not generally true for nonllotar operations. Therefore, Gaussian distributions are strongly 
associated with the analysis of linear systems. 
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3.5.4 Central Limit Theor em 

Tne Central Limit Theorem is often used as a basis for justifying the assumption that the distribution 
of some physical quantity is approximately Gaussian. 

Theorem 3.5-11 Let Y lt Y 2 ... be a sequence of independent, identically distributea random 
vectors with finite mean m and covariance A. Then the vectors 


• i E". ■ •' 


converge in distribution to a Gaussian vector with mean zero and covariance a. 

Proof See Ash (1970, p. 171 J and Apostol (1969, p. 567). 

Cramer (1946) discusses several variants on this theorem, where the Yj need net be independent and iden- 
tically distributed, but other requirements are placed on the distributions. The general result is that sums 
of random variables tend to Gaussian limits under fairly broad conditions. The precise conditions will not 
concern us here. An implication of this theorem is that macroscopic behavior which is the result of the 
summation of a large number of microscopic events often has a Gaussian distribution. The classic example is 
Brownian motion. We *Jill illustrate the Central Limit Theorem with a simple example. 

Example 3.5- 3 Let the distribution of the Y^ in Theorem (3.5-11) be uniform 

on the interval (-1,1). Then the mean is zero and the covariance is 1/3. 

Examine the density functions of the first few Z-j. 

The first function, l l% is equal tc Y 1# and thus is uniform on (-1,1). 

Figure (3.5-1) compares the densities of Z l and the Gaussian limit. The 

Gaussian limit distribution has mean zero and variance 1/3. 

For the second function we have 

■ jz Ox + V; 

ft 

and the density function of l 2 is given by 

p(z 2 ) » i («*? - |z|) for jz| i & 


Figure (3.5-2) compares the density of 7 2 with the Gaussian limit. 
The density function of Z 3 is given by 


p(z 3 ) ■ 


^ (** - 2.SUI ♦ 3) 


5 (Zl S ft 


Figure (3.5-3) compares density of Z 3 with the Gaussian limit. By the time 
H is 3, Zn is already becoming reasonably close to Gaussian. 









4.0 


35 


v 




\ 

i ' 


CHAPTER 4 







4.0 STATISTICAL ESTIMATORS 

In this chapter, we introduce the concept of at, <mator. We then define some basic measures of esti- 
mator performance. We use these measures of performance to introduce several common statistical estimators. 

The definitions in this chapter are general. Subsequent chaptei s will treat specific forms. For other 
treatments of this and related material, see Sorenson (1980), Schweppe (1973), Goodwin and Payne (1977), and 
Eykhoff (1974). These books also cover other estimators that we do not mention here. 

4.1 DEFINITION GF AN ESTIMATOR 

The concept of estimation is central to our study. The statistical definition of an estimator is as 
follows: 

Perform an experiment (input) U, taken from the set (U) of possible experiments on the system. The system 
response is a random variable: 

Z - ZU.U, u ) (4.1-1) 

where 4 e = is the true value of the parameter vector and wed is the random component of the system. 

An estimator is any function of Z with range in h. The value of the function is called the estimate 
4. Thus 

4 = 4(Z,U) - 4(Z<4,U,u)),U) (4.1-2) 

This definition is readily generalized to multiple performances of the same experiment or to the performance ^ » 

of more than one experiment. If N experiments U-j are performed, with responses Z<j, then an estimate " * 

would be of the 'orm ) 

i = 4(Z l ,...7 N ,U l ...U N ) 

" C(Z(4,U 1 ,„ l ),...Z(4,U N ,c N ),U 1 ...U N ) (4.1-3) 

where the ui are independent. The N experiments can be regarded as a single "super-experiment" 

{U Ib ..Uh) €(U)x( 3)* ... x®, the response to which is the concatenated vector (Z 1 ...Zn) e®x®x ... 

The random element is ( Wi ....i>n) e Q < fi x ... x a. Equation (4.1-3) is then simply a restatement of 
Equation (4.1-2) on vhe larger space. 

For simplicity of notation, we will generally omit the dependence on U from Equations (4.1-1) and 
(4.1-2). For the most part, we will be discussing parar.ieter estimation based on responses to specific, known 
inputs; therefore, the dependence of the response and the estimate on the input are irrelevant, and merely 
clutter up the notation. Formally, all of the distributions and expectations may be considered to be implic- 
itly conditioned on U. 

Note that the estimate l is a random var.able because it is a function of Z, which is a random varia- 
ble. When the experiment is actually performed, specific realizations of these random variables will be 
obtained. Thu true parameter value 4 is not usuaMy considered to be random, simply unknown. 

In some situations, however, it is convenient to define 4 as a random variable instead of as an unknown 
parameter. The significant difference between these approaches is that a random variable has a probability 

distribution, which constitutes additional information that can be used the random-variable approach. 3 

Several popular estimators can only be defined using the random-variable approach. These advantages of tne } 

random-variable approach are balanced by the necessity to know the probability distribution of 4 . If this « 

distribution is not known, there are no differences, except in terminology', between the randcm-variable and { 

un known -peramc ter approaches. 1 

A third view of 4 involves ideas from information theory. In this context, 4 is considered :c be an 
unknown parameter as above. Even though 4 is not random, it is defined to have a "probability distribution." 

This probability distribution does not relate to any randomness of 4 , but reflects our V .owledge or informa- 
tion about the value of 4 . Distributions with low variance correspond to a high degree of certainty about 
the value of 4 , and vice versa. The term "probability distribution" is a misnomer in this context. The terms 
"infonnation distribution" or "information function" more accurately reflect this interpretation. 

In the context of information theory, the marginal or prior distribution p( 4 ) reflects the infonnation 
about 4 prior to perfonning the experiment. A case where there is r,o prior information can be handled as a 
limit of prior distributions with less and less information (variance going to infinity). The distribution of 
the response Z is a function of the value of 4 . When 4 is a random variable, this is called p(Z| 4 ), the 
conditional distribution of z given 4 . We will use the same notation when i is not random in order to 

emphasize the dependence of the distribution on 4 , and for consistency of notation. When p( 4 ) is defined, , \ 

the joint probability density is then 

p(z,£) - p(Z|OpU) 

The marginal probability density of Z is 

p(Z) = Jp(Z,4)d]4| 


(4.1-4) 1 

| 

1 

(4.1-5) ! 


4 

s 
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The conditional density of l given Z (also called the posterior density) is 

PUU) * (4.1-6) 

In the information theory context, the posterior distribution reflects information about the value of £ after 
the experiment is performed. It accounts for the information known prior to the experiment, and the informa- 
tion gained by the experiment. 

The distinctions among the random variable, unknown parameter, and information theory points of view are 
largely academic. Although the conventional notations differ, the equations used are equivalent in all three 
cases. Our presentation uses th* probability density notation througnout. We see little benefit in repeating 
identical derivations, substituting the term " information function 1 ' for "likelihood function 1 ' and changing 
notation. We derive the basic equations or.l> once, restricting the distinctions among the three points of view 
to discussions rf applicability and interpretation. 


4.2 PROPERTIES OF ESTIMATORS 

We can define an infinite number of estimators for a given problem. The definition of an estimator pro- 
vides no means of evaluating these estimators, some of which can be ridiculously poor. This section will 
describe some of the properties used to evaluate estimators and to select a good estimator for a particular 
problem. The properties are a 1 ! expressed in terms of optimality criteria. 

4.2.1 U nbiased Estimators 

A bias is a consistent or repeatable error. The parameter estimates from any specific data set will 
always be imperfect. It is reasonable to hope, however, that the estimate obtained from a large set of 
maneuvers would be centered around the true value. The errors in the estimates might be thought of as consist- 
ing of two components- consistent err ors and random errors. Random errors are generally unavoidable. Consis- 
tent or average errors might be removable. 

Let us restate the above ideas more precisely. The bias b of an estimator c(.) is defined as 

b(c) * EUiO - e = E(f(ZU,u))U> - C (4.2-1) 

The Z in thes* equations is a random variable, not a specific realization. Note that the bias is a function 
of the true value. It averages out (by the E{.}) the random noise effects, but there is no averaging among 
the different true values. The bias is also a function of the input U, but this dependence is not usually 
made explicit. All discussions of bias are implicitly referring to some given input. 

An unbiased estimator is defined as an estimator for which the bias is identically zero: 

bU) - 0 (4.2-2) 

This requirement Is quite stringent because it must be met for every value of 5 . Unbiased estimators may not 

exist for some problems. For other problems, unbiased estimators may exist, but may be too complicated for 

practical computation. Any estimator that is not unbiased is called biased. 

Generally, it is considered desirable for an estimator to be unbiased. This judgment, however, does not 
apply to all situations. The bias of an estimator measures only the average of its behavior. It is possible 
for the individual estimates to be so poor that they are ludicrous, yet average out so that the estimator is 
unbiased. The following example is taken from Ferguson (1967, p. 136). 

Example 4.2- 1 A telephone operator has been working for 10 minutes and won- 
ders if he would be missed if he took a 20 minute coffee break. Assume that 

calls are coming in as a Poisson process with the average rate of x calls 
per 10 minutes, X being unknown. The number Z of calls received in the 
first 10 minutes has a Poisson distribution with parameter X. 

P(Z|x) “ ^-jr- Z = C.l... 

On the basis of Z, the operator desires to estimate 6, the probability of 
receiving no calls Jn the next 20 minutes. For a Poisson process, B s e _2X . 

If the estimator e(Z) is to be unbiased, we must have 

E(B(Z(B,u)) | B) * B f or all Be (0,1] 

Thus 


£ B(Z) “ 8 = e ’ a for a11 x e 

Z=o 

Multiply by e x » giving 

t f' 2 ' IT ' •“ 

Z=o 




A 



Expand the right-hand side as a power series to get 

J A / 


Z*° Z=o 


The convergent power series are equal for all \ e [0,») if the coefficients 
are identical. Thus B(Z) * (-1)* is the on.y unbiased estimator of B for 
this problem. The operator would estimate the probability of missing no 
Colls as +1 if he had received an even number of calls and -1 if he had 
received an odd number of calls. This estimator is the only unbiased estimator 
for the problem, but it is a ridiculously poor one. If the estimates are 
reQuired to lie in the meaningful range of [0,1], then there is no unbiased 
estimator, but some quite reasonable biased estimators can be easily constructed. 

The bias is a useful tool for studying estimators. In general, it is desirable for the bias to be zero, 
or at least small. However, because the bias measures only the average properties of the estimates, it cannot 
be used as the sole criterion for evaluating estimators. It is possible for a biased estimator to be clearly 
superior to all of the unbiased estimators for a problem. 

4.2.2 Minimum Variance Estimator s 

The variance of an estimator is defined as 

varU) = E(U - EU|U)(t - E{£|0)*|0 (4.2-3) 

Note that the variance, like the bias, is a function of the input and the true value. The variance alone is 
not a reasonable measure for evaluating an estimator. For instance, any constant estimator (one that always 
returns a constant value, ignoring the data) has zero variance. These are obviously poor estimators in most 
situations. 

A more useful measure is the mean square error: 

mse(C) = E Hi - E;*|0 (4.2-4) 

The mean square error and variance are obviously identical for unbiased estimators (E{£|0 - s). An estimator 
is uniformly minimum mean-square error if, for every value of £> 'ts mean square error is ^ss than or equal 
to the mean square error of any other estimator. Note that the mean-square error is a symmetric matrix. Cnc 
symmetric matrix is less than or equal to another if their difference is positive semi -definite. This defini- 
tion is somewhat academic at this poirt because such estimators do not exist except in trivial cases. A con- 
stant estimator has zero mean-square error when £ is equal to tne constant. (The performance is poor at 
ether values of £.) Therefore, in order to be uniformly minimum mean-square error, an estimator would have 
to have zero mean-square error for every c; otherwise, a constant estimator would be better for that £. 

Tne concept of minimum mean-square error becomes more useful if the class of estimators allowed is 
restricted. An estimator is uniformly minimum mean- square error unbiased if it is unbiased and, for every 
value of £» its mean-square error is less than or equal to that of any other unbiased estimator. Such esti- 
mators do not exist for every problem, because the requirement must hold for every value of Estimators 
optimum in this sense exist for many problems of interest. The mean-square error and the variance are identi- 
cal for unbiased estimators, so such optimal estimators are also called uniformly minimum variance unbiased 
estimators. T hey are also often called simply minimum variance estimators. This term should be regarded as 
an abbreviation, because it is not meaningful in itself. 

4.2.3 Cramer-Rao Inequality (E fficient Estimators) 

The Cramer-Rao inequality is one of the central results used to evaluate the performance of estimators. 
The inequality gives a theoretical limit to the accuracy that is possible, regardless of the estimator used. 

In a sense, the Cramer-Rao inequality gives a measure of the information content of the data. 

Before deriving the Cramer-Rao inequality, let us prove a brief lemma. 

Lenina 4.2-1 Let X and Y be two random N-vectors. Then 

E{XX*> > E{XY*}[£{YY*}]" l E{YX*} (4.2-5) 

assuming that the inverse exists. 

P roof The proof is done by completing the square. Let A be any nonrandom 
fi-by-N matrix. Then 


E{ (X - AY)(X - AY)*} 
because it is a covariance matrix. Expanding 
E{XX*} z aE{YX*} + E{XY*}A* 


z 0 

(4.2-6) 

- AE{YY*}A* 

(4.2-7) 


choose 


A * E{XY*)[E{YY*}]" 1 


(4.2-8) 



E{ XX*> a EiXY' t E(YY*)]' l EtYX*> f E{XY*}[ElYY*}]’ l E{YX*} 
- [E{XY*}[E{YY*}]" l E{YY*)[E{YY*j]' 1 E{YX*> 


(4.2-9) 


or 


E(XX*} i E{XY*>[E{YY*)] m E{YX*} 


(4.2-5) 


completing the lemma. 


We now seek to find a bound on E{(£ - ?)(£ - 0*|O, the mean square error of the estimate. 

Theorem 4.2-2 (Cramer-Rao) Assume that the density p(Z|0 exists and is 
smooth enough to allow the operations below. (See Cramer (1946) for details.) 
This assumption proves adequate for most cases of interest to us. Pitman (1979) 
discusses some of the cases where p(Z|e) is not as smooth as required here. 

Then 


E((C(Z) - 0(5(Z) - i)\0 > [I + v ? b(0]M(0- 1 [I ♦ y>(e)]* 

where 


MU) = E{(7* in p(Z|E))(V ? in p(Z|e))|?} 

Proof Let X and Y from lemma (4.2-1) be i(Z) - i and b£ In p(Z|c), 
respectively, and let all of the expectations in the lemma be conditioned 
on 5 . Concentrate first on the term 

E{XY*|U = E{(c(Z) - ;)(V £ in p(Z|«))|5) 

5 f(e(Z) - E)(v c in p(Z|s))p(Z|c)d!Z| 

where d|Z| is the volume element in the space Z. Substituting the 
relation 

V E (Z|€) 

tn p(z i ?) = Sir 


gives 

E{XY*| £} = f (c(Z) - t)(b £ p(Z|i))d|Z| 

* jft(Z)(v £ p(ZU))dlZ| - jt(b £ p(ZU))d|Z! 

Now £(Z) is not a function of e. Therefore, assuming sufficient smoothness 
of p(Z|0 as a function of £, the first term becomes 

/5(Z)y(Z|c)d|Z| = 7 £ J ?(Z)p(Z|c)d|Z| 

= 7 ? E{i(Z)|c) 

Using the definition (Equation (4.2-1)) of the bias, obtain 
7 £ E(5(Z)U) * 7 £ [c * b(5)] = I + 7 £ b(e) 


(4.2-10) 


(4.2-11) 


(4.2-12) 


(4.2-13) 


(4.2-14) 


(4.2-15) 

(4.2-16) 


In the second term of Equation (4.2-14), 4 is not a function of Z, so 
J’?7 £ p(Z|4)d|Z| = F^JpUIOdpI 

* «7 1 « 0 (4.2-17) 

Using Equations (4.2-16) and (4.2-17) in Equation (4.2-14) gives 

E{XY*|0 = I + v.b(0 (4.2-18) 

Define the Fisher Information matrix 

M(?) = E{YY*U) = E{( 7 * in p(Z|£))(7 t in p(Z|?))U> (4.2-19) 

They by lemma (4.2-1) 

E((t(z) - c)(c(z) - s)*U> * [i ♦ v c b( 0 ]M(e)‘ l [i + ^b(e)]* (4.2-10) 

which Is the desired result. 


Equation (4.2-10) is the Cramer-Rao inequality. Its specialization to unbiased estimators Is of particular 
Interest. For an unbiased estimator, b(s) Is zero so 

£{(C(Z) - i)U(Z) - * M (O' 1 


(4.2-20) 
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This gives us a lower bound* as a function of 4 , on the achievable variance of any unbiased estimator. An 
unbiased estimator which attains the equality in Equation (4.2*20) is called an efficient estimator. No 
estimator can achieve a lower variance than an efficient estimator except by introducing a bias in the esti- 
mates. In this sense, an efficient estimator makes the most use of the information available in tne data. 


The above development gives no guarantee that an efficient estimator exists for every problem. When an 
efficient estimator does exist. It is also a uniformly minimum variance unbiased estimator. It is much easier 
to check for equality in Equation (4.2-20) than to directly prove that no other unbiased estimator has a 
smaller variance than a given estimator. The Cramer-Rao Inequality is tnerefore useful as a sufficient (but 
not necessary) check that an estimator is uniformly minimum variance unbiased. 


A useful alternative expression fo* the information matrix M can be obtained if p(Z|rJ is sufficiently 
smooth. Applying Equation (4.2-13) to the definition of M (Equation (4.2-19)) gives 


f(v*P(Z| 5 ))(v C P(Z|4)) 
MU) ■ E|-S 5 


P(ZU) J 


I* 


Then examine 


E<v* tn p(Z|t)|f. 


v p(zje) 
pWIT 


I) 


r.) - e|v* 

fv|p(zic} n f(v* P (zic))7 p(zu) n 

^ — r) 


The second term is equal to M(0, as shown in Equation (4.2-21). Evaluate the first term as 

p(z|0 


/•VfP(ZC) f, 

J pUU T' P (z l^ d i z l -Jv(z|Od|z| 

* v‘Jp(Z|e)d|Z| 


= V *1 = 0 


Thus an alternate expression Tor the information matrix is 

MU) - -E{v* tn p(Z|0|5> 


(4.2-21) 


(4.2-22) 


(4.2-23) 


(4.2-23) 


(4.2-24) 


4.2.4 Bayesian Optimal Estimators 

The optimality conditions of the previous sections have been quite restrictive in that they must hold 
simultaneously for every possible value of 4 . Thus for some problems* no estimators exist that are optimal 
by these criteria. The Bayesian approach avoids this difficulty by using a single, overall, optimality 
criterion which averages the errors made for different values of 4 . With this approach, an optimal estimator 
may be worse than a nonoptimal one for specific values of 4, but the overall averaged performance of the 
Bayesian optimal estimator will be better. 

The Bayesian approach requires that a loss function (risk function, optimality criterion) be defined as a 
function of the true value 4 and the estimate 4 . The most comnon loss function is a weighted square error 

J(C.4) = U - £)*R(4 - 4) (4.2-25) 

where R is a weighting matrix. An estimator is considered optimal in the Bayesian sense if it minimizes the 
a posteriori expected value of the loss function: 

£{JUU(Z))|Z} = jj(c.t(z))p(5|z)dU| 

/j( 5 . 5 (Z))p(Z|s)p(c)d|c| 

-_l ( 4 . 2 - 26 ) 

An optimal estimator must minimize this expected value for each Z. Since P(Z) is not a function of 4> it 
does not affect the minimization of Equation (4.2-26) with respect to 4 . Thus a Bayesian optimal estimator 
also minimizes the expression 

/>U,C(Z))p(ZU)pU)d|4| (4.2-Z7) 

Note that p(4), the probability density of 4 , is required in order to define Bayesian optimality. For this 
purpose, p(4) can be considered simply as a weighting that Is part of the loss function, if it cannot 
appropriately be interpreted as a true probability density or an information function (Section 4.1). 

4.2.5 Asymptotic Properties 

Asymptotic properties concern the characteristics of the estimates as the amount of data used Increases 
toward Infinity. The amount of data used can Increase either by repeating experiments or by increasing the 
time slice analyzed In a single experiment. (The latter Is pertinent only for dynamic systems.) Since only 
a finite amount of data can be used in practice, it Is not Immediately obvious why there is any Interest In 
asymptotic properties. 
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This Interest arises primarily from considerations of simplicity. It is often simpler to compute asymp- 
totic properties and to construct asymptotically optimal estimators than to do so for finite amounts of data. 

We can then use the asymptotic results as good approximations to the more difficult finite data results If the 
amount of data used Is large enough. The finite data definitions of unbiased estimators and efficient esti- 
mators have direct asymptotic analogues of Interest. An estimator is asymptotically unbiased if the bias goes 
to zero for all e as the amount of date gets to Infinity. An estimator Is asymptotically efficient If It is 
asymptotically unbiased and if 

MU)E{(£ - C)U - 0*|O - I (4.2-28) 

as the amount of data approaches infinity. Equation (4.2-28) is an asymptotic expression for equality In 
Equation (4.2-20). 

One important asymptotic property has no finite data analogue. This Is the notion of consistency. An 
estimator is consistent if i ■+ e as the amount of data goes to Infinity. For strong consistency, the con- 
vergence is required to be with probability one. Note that strong consistency is defined in terms of (.he 
convergence of individual realizations of the estimates, unlike the bias, variance, and other properties which 
are defined in terms of average properties (expected values). 

Consistency is a stronger property than asymptotic unbiasedness; that i», all consistent estimators are 
asymptotically unbiased. This is a basic convergence result- that convergence with probability one implies 
convergence In distribution (and thus, specifically, convergence in mean). We refer the reader to Lipster and 
Shlryayev (1977), Cramer (1946), Goodwin and Payne (1977), Zacks (1971), and Mehra and Lainiotis (1976) for 
this and other results on consistency. Resuits on consistency tend to involve careful mathematical arguments 
relating to different types of convergence. 

We will not delve deeply into asymptotic properties such as consistency in this book. We generally feel 
that asymptotic properties, although theoretically intriguing, should be played down in practical application. 
Application of infinite-time results to finite data is an approximation, one that is sometimes useful, but 
sometimes gives completely misleading conclusions (see Section 8.2). The inconsistency should be evident in 
books that spend copious time arguing fine points of distinction between different kinds of convergence and 
then pass off application to finite data with cursory allusions to using large data samples. 

Although we de-emphaslze the "rigorous" treatment of asymptotic properties, some asymptotic results are 
crucial to practical implementation. This is not because of any improved rigor of the asymptotic results, but 
because the asymptotic results are often simpler, sometimes enough simpler to make the critical difference In 
usability. This Is our primary use of asymptotic results: as simplifying approximations to the finite-time 

results. Introduction of complicated convergence arguments hides this essential role. The approximations work 
well in many cases and, as with most approximations, fail in some situations. Our emphasis in asymptotic 
results will center on justifying when they are appropriate and understanding when they fall. 


4.3 COMMON ESTIMATORS 

This section will define some of the conroonly used general types of estimators. The list Is far from 
complete; we mention only those estimators that will be used in this book. We also present a few general 
results characterizing the estimators. 

4.3.1 A posteriori Expected Value 

One of the most natural estimates Is the 
mean of the posterior distribution. 

UD * 


This estimator requires that p(0» the prior 
4.3.2 Bayesian Minimum Risk 

Bayesian optimality was defined in Section 4.2.4. Any estimator which minimizes the a posteriori 
expected value of the loss function Is a Bayesian minimum risk estimator. (In general, there can be more than 
one such estimator for a given problem.) The prior distribution of z must be known to define Bayesian 
estimators. 


a posteriori expected value. This estimate is defined as the 


E(5|Z> * Jtp(5|Z)d|5l 
/cp(zU )pU)dU| 
/p(Z|5)pU)d|{| 
density of s, be known. 


(4.3-1) 


Theorem 4.3-1 The a posteriori expected value (Section 4.3.1) Is the unique 
Bayesian minimum risk estimator for the loss function 

JU.0 * (£ - t)*R(5 - t) (4.3-2) 

where R Is any positive definite symnetrlc matrix. 

Proof A Bayesian minimum risk estimator must minimize 

E{J|Z) • E{({ - t(Z))*R(c - € (Z ) ) | Z) 


(4.3-3) 
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Since R Is symmetric, the gradient of this function Is 

v*E{J|Z} « -2E{R(t - £{Z)) |Z>* (4.3-4) 

Setting this expression to zero gives 

0 * R E U - Ul)\l) • R[EU|Z) - UD] (4.3 -6) 

Therefore 

HD • EU|Z) (4.3-6) 

is the unique stationary point of E{J|Z}. The second gradient Is 

V|E{ J|Z> * 2R > 0 (4 3-7) 

so the stationary point Is the global minimum. 

Theorem (4.3-1) applies only for the quadratic loss function of Equation (4.3-2). The following very 
similar theorem applies to a much broader class of loss functions, but requires the assumption that pU|Z) Is 
symmetric about its mean. Theorem (4.3-1) makes no assumptions about p(;|Z) except that it has finite mean 
and variance. 


Theorem 4.3-2 Assume that pU|Z) is synmetric about its mean for each Z; i.e., 


P € | 2 (C(Z) ♦ t| Z) * P^dU) - t|Z) (4.3-8) 

where l(Z) Is the expected value of £ given Z. Then the a posteriori expected value is the 
unique Bayesian minimum risk estimator for any loss function of the form 

OUA) - - i) (4.3-9) 

where J x is symmetric about 0 and is strictly convex. 

Proof We need to demonstrate that 

D(a) = E{JU,5(Z) + a|Z> - E{JU,C(Z) |Z) > 0 (4.3-10) 

for all a f 0. Using Equation (4.3-9) and the definition of expectation 

0(a) - JpUIZHOjU - ?(Z) - a) - - e(Z))]dU| (4.3-11) 

Because of the symmetry of pU|Z), we can replace the Integral in Equa- 
tion (4.3-11) by an Integral over the region 

S - U:U - c(Z),a) > 0) (4.3-12) 


giving 

0(e) ■ J r PU[Z)[J,U - 1(2) - a) ♦ OjUU) - 5 - a) 


- J Y (t - Id)) - J,(|(2) - e)]d|c| (4.3-13) 

Using the symmetry of gives 

o(a) ' f P(5|Z)[J 1 (e - Ul) - a) ♦ - KD * a) 

- 2J Y U - 5(Z)]d|{| (4.3-14) 

By the strict convexity of 

J,(C - Id) - «) ♦ J X U - UD ♦ «) > 2J ; U - id)) (4.3-15) 

for all a f 0. Therefore D(a) > 0 for all a t 0 as we desired to show. 

Note that. If J, is convex, but not strictly convex, theorem (4.3-2) still holds except for the unique- 
ness. Theorems (4.3-1) and (4.3-2) are two of the basic results in the theory of estimation. They motivate 
the use of a posteriori expected value estimators. 


4.3.3 Maximum a posteriori Probability 

The maximum a posteriori probability (MAP) estimate is defined as the mode of the posterior distribution 
(I.e., the value of c which maximizes the posterior density function). If the distribution Is not unlmodal, 
the MAP estimate may not be unique. As with the previously discussed estimators, the prior distribution of 
K must be known In order to define the MAP estimate. 


The MAP estimate is equal to the a posteriori expected value (and thus to the Bayesian minimum risk for 
loss functions meeting the conditions of Theorem (4.3-2)) If the posterior distribution Is synraetrlc about its 
mean and unlmodal, since the mode and the mean of such distributions are equal. For nonsymmetrlc distribu- 
tions, this equality does not hold. 
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The MAP estimate is generally much easier to calculate than the a posteriori expected value. The 
a posteriori expected value Is (from Equation (4.3-1)) 

Ap(Z!c)pU)d|c| 

cU) * > (4.3-16) 

Jp(Z)U)p(0«JUI 

This calculation requires the evaluation of two integrals ove r I . The MAP estimate requires the maxi- 
mization of 

p(t|Z) - (4.3-17) 

with respect to £. The p(Z) is not a function of z* so the MAP estimate can also be obtained by 

£(Z) = arg max p(Z|t)p(c) (4.3-18) 

S 

The "arg max" notation indicates that z is the value of z that maximizes the density function p(Z|t)p(0* 

The maximization in Equation (4.3-18) is generally much simpler than the Integrations in Equation (4.3-16). 

4.3.4 Maximum Like lihood 

The previous estimators have all required that the prior distribution of z be known. When z is not 
random or when its distribution is not known* there are far fewer reasonable estimators to choose from. Maxi- 
mum likelihood estimators are the only type that we will discuss. 

The maximum likelihood estimate is defined as the value of z which maximizes the likelihood functional 
p(Z|£); in other words, 


C(Z) = arg max p(Z|c) (4.3-19) 

S 

The maximum likelihood estimator is closely related to the MAP estimator. The MAP estimator maximizes p(c|Z); 
heuristlcally we could say that the MAP estimator selects the most probable value of £, given the data. The 
maximum likelihood estimator maximizes p(Z|Oi i.e., it selects the value of z which makes the observed data 
most plausible. Although these may sound like two statements of the same concept, there are crucial differ- 
ences. One of the most central differences is that maximum likelihood is defined whether or not the prior 
distribution of z is known. 

Comparing Equation (4.3-18) with Equation (4.3-19) reveals that the maximum likelihood estimate is iden- 
tical to the MAP estimate if p(0 is a constant. If the parameter space K has finite size, this implies 
that pU) is the uniform distribution. For infinite E, such as R n , there are no uniform distributions, so 
a strict equivalence cannot be established. If we relax our definition of a probability distribution to allow 
arbitrary density functions which need not integrate to 1 (sometimes called generalized probabilities), the 
equivalence can be established for any = . Alternately, the uniform distribution for infinite size E can be 
viewed as a limiting case of distributions with variance going to infinity (less and less prior certainty about 
the value of 0* 

The maximum likelihood estimator places no preference on any value of £ over any other value of 5; the 
estimate is solely a function of the data. The MAP estimate, on the other hand, considers both the data and 
the preference defined by the prior distribution. 

Maximum likelihood estimators have many interesting properties, which we will cover later. One of the 
most basic is given by the following theorem: 

Theorem 4.3-3 If an efficient estimator exists for a problem, that estimator 
is a maximum likelihood estimator. 

Proof (This proof requires the use of the full notation for probability 
density functions to avoid confusion.) Assume that c(Z) fs any efficient 
estimator. An estimator will be efficient If and only if equality holds 
in lemma (4.2-1). Equality holds if and only If X - AY in Equation (4.2-6). 

Substituting for A from Equation (4.2-8) gives 

X - E{XY*>E{YY*r l Y (4.3-20) 

Substituting for X and Y as In the proof of the Cramer-Rao bound, and using 
Equations (4.2-18) and (4.2-19) gives 

W) - i • [I + ^MOlMUrv* ^ P 2 |^(Z|0 (4.3-21) 

Efficient estimators must be unbiased, so b(c) is zero and 

C(Z) - t * M(f)- 1 V* in p Z | t (ZU) (4.3-22) 

For an efficient estimator, Equation (4.3-22) must hold for all values of Z 
and In particular, for each Z, the equation must hold for z * |(Z). 

The left-hand side s then zero, so we must have 


v* in p Z | 5 (Z|c(Z)) ■ 0 


(4.3-23) 
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The estimate Is thus at a stationary point of the likelihood functional. 

Taking the gradient of Equation (4.3-22) 

-I * in P 2 j 5 (Z|E) - MU)- 1 [V 5 M'E)]M(C)- 1 V£ in p z j { (ZU) (4.3-24) 

Evaluating this at { ' e(Z), and using Equation (4.3-23) gives 

-I • M(UZ))’ 1 ^ *n p Z | 5 (Z|«(Z)) (4.3-24) 

Since M is positive definite, the stationary point is a local maximum. 

In fact, it is the only local maximum, because a local maximum at any point 
other than 4 a tU) would violate Equation (4.3-22). The requirement for 
/r:Utt|e)d|Zi to be finite Implies that pzu(Z|4) 0 as 1 2 1 -► ®, so 
that the local maximum will be a global maximum. Therefore 4 (Z) Is a 
maximum likelihood estimator. 

Corollary All efficient estimators for a problem are equivalent (i.e.. If 
arT efficient estimator exists. It is unique). 

This theorem and its corollary are not as useful as they might seem at first glance, because efficient 
estimators do not exist for many problems. Therefore, it Is not always true that a maximum likelihood esti- 
mator is efficient. The theorem does apply to some simple problems, however, and motivates the more widely 
applicable asymptotic results which will be discussed later. 

Maximum likelihood estimates have the following natural invariance property: let l be the maximum 

likelihood estimate of i; then f(£) is the maximum likelihood estimate of f(c) for any function f. The 
proo f of this statement is trivial 1* f is invertible. Let L^U.Z) be the likelihood functional of 4 for 
a given Z. i)efine 


x = f (0 


(4.3-26) 


Then the likelihood function of x is 

L x (x,Z) = L ^ { f “ 1 ( x ) ,Z) (4.3-27) 

This is the crucial equation. By definition, the left-hand side is maximized by x = x, and the right-hand 
side is maximized by f - 1 (x) = 4 . Therefore 

x - f(C) (4.3-28) 

The extension to noninvertible f is straightforward- simply realize that f’ 1 (x) is a set of values, rather 
than a single value. The same argument then still holds, regarding L x (x,Z) as a one-to-many function (set- 
valued function). 

Finally, let us emphasize that, although maximum likelihood estimates are formally identical to MAP esti- 
mates with uniform prior distributions, there is a basic theoretical difference in interpretation. Maximum 
likelihood makes no statements about distributions of 4 , prior or posterior. Stating that a parameter has a 
uniform pnor distribution is drastically different from saying that we have no information about the param- 
eter. Several classic "paradoxes" of probability theory resulted from ignoring this difference. The para- 
doxes arise In transformations of variable. Let a scalar 4 have a uniform prior distribution, and let f 
be any continuous invertible function. Then, by Equation (3.4-1), x ■ f(0 has the density function 

P x (x) « P c (f* 1 (x)) |f“ 1 (x) | . (4.3-29) 

which is not a uniform distribution on x (unless f Is linear). Thus if we say that there is no prior 
information (uniform distribution) about 4 , then this gives us prior information (nonuni form distribution) 
about x, and vice versa. This apparent paradox results from equating a uniform distribution with the idta 
of "no information." 

Therefore, although we can formally derive the equations for maximum likelihood estimators by substituting 
uniform prior distributions in the equations for MAP estimators, we must avoid misinterpretations. Fisher 
(1921, p. 326) discussed this subject at length: 

There would be no need to emphasize the baseless character of the assumptions 
made under the titles of inverse piobability and BAYES' Theorem In view of 
the decisive criticism to which they have been exposed.... I must Indeed plead 
guilty in my original statement of the Method of Maximum Likelihood '9) to 
having based my argument upon the principle of Inverse probability; In the 
same paper. It is true, I emphasized the fact that such Inverse probabilities 
were relative only. That Is to say, that while one might speak of one value 
of p as having an inverse probability three times that of another value cf p, 
we might on no account introduce the differential element dp, so as to be 
able to say that it was three times as probable that p should lie in one 
rather than the other of two equal elements. Upon consideration, therefore, I 
perceive that the word probability Is wrongly used in such a connection: 
probability is a ratio of frequencies, and about the frequencies of such values 
we can know nothing whatever. We must return to the actual fact that one value 
of p, of the frequency of which we know nothing, would yield the observed 
result three times as frequently as would another value of p. If we need a 
word to characterize this relative property of different values of p, I suggest 


that we may speak without confusion of the likelihood of one value of p 
being thrice the likelihood of another, bearing always In mind that likeli- 
hood Is not here used loosely as a synonym of probability, but simply to 
express the relative frequencies with which such values of the hypothetical 
quantity p would In fact yield *he observed sample. 
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CHAPTER 5 


5.0 THE STATIC ESTIMATION PROBLEM 

In this chapter begins the application of the general types of estimators defined In Chapter 4 to 
specific problems. The problems discussed In this chapter are static estimation problems; that Is, problems 
where time Is not explicitly Involved. Subsequent chapters on dynamic systems draw heavily on these static 
results. Our treatment Is far from complete; It Is easy to spend an entire book on static estimation alone 
(Sorenson, 1980). The material presented here was selected largely on the basis of relevance to dynamic 
systems. 

We concentrate primarily on linear systems with additive Gaussian noise, where there are simple, closed- 
form solutions. We also cover nonlinear systems with additive Gaussian noise, which will prove of major 
Importance In Chapter 8. Non-Gausslan and nonadditive noise are mentioned only briefly, except for the special 
problem of estimation of variance. 

We will initially treat nonsingular problems, where we assume that all relevant distributions have density 
functions. The understanding and handling of singular and Ill-conditioned problems then receive special 
attention. Singularities and Ill-conditioning are crucial issues In practical application, but are Insuffi- 
ciently treated in much of the current literature. We also discus* partitioning of estimation problems, an 
Important technique for simplifying the computational task and treating some singularities. 

The general form of a static system model is 

Z - Z(t,U, w ) (5.0-1) 

We apply a known specific input U (or a set of inputs) to the system, and measure the response Z. The 
vector id is a random vector contaminating the measured system response. We desire to estimate the value 
of £. 

The estimators discussed in Chapter 4 require knowledge of the conditional distnoufLn of Z given £ 
and U. We assume, for now, that the distribution is nonsingular, with density p(Z|£,U). If £ Is con- 
sidered random, you must know the joint density p(Z,£|U). In some simple cases, these densities might be 
given directly, In which case Equation (5.0-1) is not necessary; the estimators of Chapter 4 apply directly 
More typically, p(Z|£,U) Is a complicated density which Is derived from Equation (5.0-1) and p(w]c,U). It is 
often reasonable to assume quite simple distributions for Independent of £ and U. In this chapter, 
will look at several specific cases. 


5.1 LINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE 

The simplest and most classic results are obtained for linear static systems with additive Gaussian noise. 
The system equations are assumed to have the form 

Z « C(U)t ♦ D(U) ♦ G(U)w (5.1-1) 

For any particular U, Z Is a linear combination of £ t w, and a constant vector. Note that there are no 
assumptions about linearity with respect to U; the functions C, D, and G can be arbitrarily complicated. 
Throughout this section, we omit the explicit dependence on U from the notation. Similarly, all distribu- 
tions and expectations are implicitly understood to be conditioned on U. 

The random noise vector u is assumed to be Gaussian and independent u* £. By convention, we will 
define the mean of u> to be 0, and the covariance to be identity. This convention does not limit the gener- 
ality of Equation (5.1-1), for If u> has a mean m and a finite covariance FF*, we can define G 2 ■ GF 
and Di * 0 + m to obtain 

Z « C£ ♦ D 2 + G 2 u) 2 (5.1-2) 

wh?re o> 2 has zero mean and Identity covariance. 

When £ Is considered as random, we will assume that Its marginal (prior) distribution is Gaussian with 
mean m^ and covariance P. 

P U) - I21PI- 1 /’ exp{- | U - - » c )| (5.1-3) 

Equation (5.1-3) assumes that P Is nonsingular. We will discuss the implications and handling of singular 
cases later. 

5.1.1 Joint Distribution of Z and c 

Several distributions which can be derived from Equation (5.1-1) will be required In order to analyze this 
system. I*t us first consider p(Z|c), the conditional density of Z given £. This distribution Is defined 
whether £ Is random or not. If £ Is given, then Equation (5.1-1) Is simply the sum of a corstant vector 
and a constant rutrlx times a Gaussian vector. Using the properties of Gaussian distributions discussed In 
Chapter 3, we see that the conditional distribution of Z given £ Is Gaussian with mean and covariance. 

E(Z|£} ■ Cc ♦ D 

co v{ Z j £ } * GG* 


(5.1-4) 

(5.1-5) 
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Thus, assuming that 66* Is nonsingular, 

p(Zlc) • 1 2n6G* | “ 1 / a ex[)|- \ (Z - Cc - D)»(GG*)*‘U - C £ - D)j (5.1-6) 

If c Is random, with marginal density given by Equation (5.1-3), we ran also mea--^ . I ly define the 

joint distribution of 2 and c, the conditional distribution of c given 2, and th* me 'In.' distribution 
of Z. 


for the marginal distribution of Z, note that Equation (5.1-1) Is a linear combin*;lor< c' independent 
Gaussian vectors. Therefore Z is Gaussian with mean and covariance 



E{Z> » Cm c + D 

(5.1-7) 


cov(Z) - CPC* + GG* 

(5 1-8) 

For the joint distribution of 

t and Z, we now require tne cross-correlation 



E([Z - E(Z)][c ^ E(c)]*) - CP 

(5.1-9) 

The joint distribution of s and Z 

is thus Gaussian with mean and covariance 

r t r i 



f [v] ' [-;* ;] 

(5.1-10) 


HyH’p ! 

(5.1-11) 


.rote that this joint distribution could also be derived by multiplying Equations (5.1-3) &nd (5.1-6) according 
to Bayes rule. That derivation arrives at the same results for Fquatlons (5.1-10) and (5.1-11), but Is much 
more tedious. 

finally, we can derive the conditional distribution of t given Z (the posterior distribution of 0 from 
the joint distribution of <. and Z. Applying Theorem (3.5-9) to Equations (5. MO) and (5.1-11), we see that 
the conditional distribute of K given Z Is Gaussian with mean and covariance 

Etc |Z> - + PC*(CPC* + GG*) M (Z - Cm^ - D) (5.1-12) 

cov(c|Z) * P - PC*(CPC* + GG*)“ 1 CP (5. 1-13) 

Equations (5.1-1?) and (5.1-13) assume that CPC* + GG* is nonsingular. I* this matrix 1* singular, the 
problem is Ill-posed and should be restated. We will discuss the singular case later. 

Assuming that P, GG*. and (C*(GG*)' l C + P" 1 ) are nonsingular, we can use the matrix inversion lemmas, 
(lemmas (A. 1-3) and (A. 1-4)), to put Equations (5.1-12) and (5.1-13) Into forms that will prove Intuitively 
useful . 

EU|Z) * m ♦ (C*(GG*)~ X C + P* 1 )’ 1 C*(GG*)" 1 (Z - Cm - D) (5.1-14) 

cov(c|Z) * (C*(GG*)‘ 1 C + P* 1 )" 1 (5.1-15) 

We will have much occasion to contrast the form of Equations (5.1-12) and (5.1-13) with the form of 
Equations (5.1-14) and (5.1-15). We will call Equations (5.1-12; and (5.1-13) the covariance form because they 
rre In terms of the uninverted covariances P and GG*. Equations (5.1-14) and (5.1-15) are called the infor- 
mation form because they are in terms of the inverses P* 1 and (GG*)* 1 , which are related to <»«. amo;nt of 
information. (The larger the covariance, the less information you have, and vice versa.) Equation (5.1-15) 
has an interpretation as addition of information: P’ 1 Is the amount of prior information about c# and 
C*(66*)* 1 C is the amount of information In the measurement; the total information after the i..^asurement is 
the sum of these two terms. 

5.1.2 A Posteriori Estimators 


Let us first examine the three types of estimators that are based on the posterior distribution p(t|Z). 
These three types of estimators are a posteriori expected value, maximum a posteriori probability, and 
Bayesian minimum risk. 

We previously derived the expression for the a posteriori expected value in the process of defining the 
posterior distribution. Either the covariance or Information form can be used. We will use the information 
form because It ties In with other approaches as will be seen below. The a posteriori expected value 
estimator is thus 


i • m + (t*GG*r l C ♦ p* l )“ 1 C*(6G*)" 1 (Z - Cffl^ - D) (5.1-16) 

The maximum a posteriori probability estimate is equal to the a posteriori expected value because the 
posterior distribution is Gaussian (and thus unimodal and symmetric about Its mean). This fact suggests an 
alternate derivation of Equation (5.1-16) which is quite enlightening. To find the maximum point of the 
posterior distribution of c given Z, write 
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(5.1-17) 


*n p{t|Z) = an p(Z|E) + an pU) - an p(2) 

Expanding this equation using Equations (5.1-3) and (5.1-6) gives 

in p(?|Z) * - \ (Z - - D)*(GG*)-MZ - Ce - D) - \ U - - «> t ) + a(Z) (5.1-18) 

where a(Z) is a function of Z jnly. Equation (5.1-13) shows the problem in its “least squares’ 1 form. We 
are attempting to choose e to minimize (g - mg) and (Z - Ce - D). The matrices P" 1 and (GG*)' 1 are 
weightings used in the cost functions. The larger the value of (GG*)’ 1 , the more importance is placed on 
minimizing (Z - Ce - D), and vice versa. 

Obtain the estimate £ b> setting the gradient of Equation (5.1-18) to zero, as suggested by Equa- 
tion (3.5-17). 


0 = C*(GG*)‘ 1 (Z - Cg - D) - P’Hl - m ) (5.1-19) 

Write this as 

0 = C*(GG*)’ 1 (Z - Cm^ - D) - P'^E - m^) - C*(GG*) _1 C(e - m^) (5.1 20) 

und the solution is 

g * + (C*(GG*pC + P“ l )“ 1 C*(GG*)” 1 (Z - Cm c - D) (5.1-21) 

assuming that the inverses exist. For Gaussian distributions. Equation (3.5-18) gives the covariance as 

covU|Z) = - r v* zn p(ElZ)]’ 1 - (C(GG*)“ 1 C + P‘ 1 )" 1 (5.1-22) 

Note the second gradient is negative definite (and the covariance positive definite), verifying that the 
solution is a maximum of the posterior probability density function. This derivation does not require the use 
of matrix inversion lemmas, or the expression from Chapter 3 for the Gaussian conditional distribution. For 
more complicated problems, such as conditional distributions of N jointly Gaussian vectors, the alternate 
derivation as in Equations (5.1-17) to (5.1-22) is much easier than the straightforward derivation as in 
Equations (5.1-10) to (5.1-15). 

Because of the synmetry of the posterior distribution, the Bayesian optimal estimate is also equal to 
the a posteriori expected value estimate if the Bayes loss function meets the criteria uf Theorem (4.3-1). 

We will now examine the statistical properties of the estimator given by Equation (5.1-16). Since the 
estimator is a linear function of Z, the bias is easy to compute. 

b(c) = E {g\g) - g 

* E(m c + (C*(GG*)“ l C + F~ i )“ 1 C*(GG*)- i (Z - Cm^ - D))U - c 

= m + (C*(GG*)’ 1 C + P’ 1 )’ 1 C*(GG*)'' 1 [E{Z|U - Cm - D] - E 

= m + (C*(GG*) _1 C + P’ 1 )" 1 C*(GG*)‘ 1 (CE + D - Cm^ - D) - g 

= [I - (C*(GG*r x C + P“ 1 )* 1 C*(GG*)“ i C](m c - g) (5.1-23) 

The estimator ib biased fui all finite nonsingular P and GG*. The scalar case gives „ume insiaht into this 
bias. If g is scalar, the factor in brackets in Equation (5.1-23) lies between 0 and 1. As GG* decreases 
and/or P increases, the factor approaches 0, as does the bias. In this case, the estimator obtains less 
information from the initial guess of g (which has large covariance), and more information from the measure- 
ment (which has small covariance). If the situation is reversed, GG* increasing and/or P decreasing, the 
bias becomes larger In this case, the estimator shows an increasing predilection to ignore the measured 
response and to keep the initial guess of g. 

The variance and mean square error are also easy to compute. The variance of g follows directly from 
Equations (5.1-16) and (5.1-5): 

COV(CU) - (C*(GG*) _1 C + P” 1 )" 1 C*(GG*)“ l GG*(GG*)‘ 1 C(C*(GG*)" 1 C + P - * 1 )' 1 

= (C*(GG*)’ 1 C + P’ 1 )’ 1 C*(GG*)’ l C(C*(GG*)’ 1 C + P’ 1 )' 1 (5.1-24) 

The mean square error is then 

mse ( g ) - cov(i|0 + bU)b(E)* (5.1-25) 

which is evaluated using Equations (5.1-23) and (5.1-24). 

The most obvious question to ask in relation to Equations (5.1-24) and (5.1-25) is how they compare with 
other estimators and with the Cramer-Rao bound. Let us evaluate the Cramer-Rao bound. The Fisher information 
matrix (Equation (4.2-19)) Is easy to compute using Equation (5.1-0: 

M * E(C*(GG*)‘ 1 (Z - Zg - 0) (Z - C? - 0)*(GG*)“ 1 C} 

* C*(GG*)’ l GG*(GG*)’ l C * C*(GG*)" l C (5.1-26) 





(5.1-27) 


Thus the Creaer-Rao bound for unbiased estimators is 

•se(clc) > (C*(G vr'cr 1 

Note that, for some valuss of t, the a posteriori expected value estimator has a lower mean- square error than 
the Cramer-Rao bound for unbiased estimators; naturally, this is because the estimator is biased. To compute 
the Cramer-Rao bound for an estimator with bias given by Equation (5.1-23), we need to evaluate 

I ♦ <^bU) « I + (C*(G6*)' 1 C + - I 

* (C*(G6*)- l C + P“ l )“ x C*(G6*)“ 1 C (5.1-28) 

The Craaer-Rao bound is then (from Equation (4.2-10)) 

■se(llt) 2 (C*(GG*)’ l C ♦ P- X )“ X C*(S6*)- 1 C(C*(GG^)“ X C + P’ 1 ) -1 (5.1-29) 

Note that the estimator does not achieve the Cramer-Rao bound except at the single point c * mr. At every 
other point, the second term in Equation (5.1-25) is positive, and the first term is equal to the bound; 
therefore, the mse is greater than the bound. 


For a single observation, we can say in sumary that the a posteriori estimator is optimal Bayesian for 
a large class of loss functions, but it is biased and does not achieve the Cramer-Rao lower bound. It remains 
to investigate the asymptotic properties. The asymptotic behavior of estimators for static systems is defined 
in terms of N independent repetitions of the experiment, where N approaches infinity. We must first define 
the application of the a posteriori estimator to repeated experiaents. 

Assume that the system model is given by Equation (5.1-1), with ( distributed according to Equa- 
tion (5.1-3). Perform N experiments Jit does not matter whether the U, are distinct.) The 

corresponding system matrices are Cf, t)\ % and G^G?, and the measurements are l\, Th* random noise wj is an 
independent, zero-mean, identity covariance, Gaussian vector for each i. The maxi sum a posteriori estimate of 
S is given by 

« ■ - c ♦ |Z c T (6 t 6 T ,_lc i + n Z c ?W‘< z i - c i* 5 - °i> f 5 - 1 - 30 * 

L 1=1 J i * 1 

assuming that the inverses exist. 

The asymptotic properties are defined for repetition of the same experiment, so we do not need the full 
generality of Equation (5.1-30). If Uj = Uj, Ci = Cj, 0j * Dj, and Gj = Gj for all i and j. Equa- 
tion (5.1 -30) can be written 

N 

c * + C»*C*(GG*)- i C + P- , ]- I C*(GG*)- 1 £ (Z i - - D) (5.1-31) 

i*i 


Compute the bias, covariance, and mse of this estimate in the same manner as Equations (5.1-23) 
to (5.1-25): 


b(0 = [I - (NC*(G6*)‘ 1 C + P'M'^tGG*) - ^]^ - t) (5.1-32) 

cov(cU) * [NC*(GG*)" 1 C + P' X 2“ 1 NC*(GG*)-' 1 C[P4C*(G6*)“ X C + P" 1 ]" 1 (5.1-33) 

mse(c|0 * cov(elt) + b(c)b(e)* (5.1-34) 

The Cramer-Rao bound for unbiased estimators is 

mse(cU) * (NC*(GG A )'' 1 C)“ 1 (5.1-35) 


As N increases. Equation (5.1-32) goes to zero, so the estimator is asymptotically unbiased. The effect o* 
increasing N is exactly comparable to Increasing (GG*)* 1 ; as we take more and better quality measurements, 
the estimator depends more heavily on the measurements and less on its initial guess. 

The estimator is also asymptotically efficient as defined by Equation (4.2-28) because 

NC*(G6*)* 1 C covUU) — ► I (5.1-36) 

N 

NC*(GG*r*C b(t)bU)* — 0 (5.1-37) 

N 


5.1.3 Maximum Likelihood Estimator 

The derivation of the expression for the maximum likelihood estimator is similar ♦ the derivation of the 
maximum a posteriori probability estimator done in Equations (5.1-17) to (5.1-22). only difference is 
that Instead of in ?(t|Z)» we maximize 


tn p(2|{) - - \ (Z - C« - D)*(GG*)-‘(Z - Cc - D) + a(Z) 


(5.1-38) 



5.1.3 


49 


The only relevant difference between Equation (5.1-38) and Equation (5.1-18) is the inclusion of the term based 
on the prior distribution of e in Equation (5.1-18). (The a(z) are also different, but this is of no con- 
sequence at the moment.) The maximum likelihood estimate does not make use of the prior distribution; indeed 
it does not require that such a distribution exist. We w 11 see that many of the W.E results are equal to the 
MAP results with the terms from the prior distribution omitted. 

Find the maximum point of Equation (5.1-38) by setting the gradient to zero. 

0 * C*(GG*T l (Z - CC - 0) (5.1-39) 

The solution, assuming that C*(GG*)~ A C is nonsingular, is given by 

i = (C'*(G6*)” 1 C)' 1 C*(G6*)“ l (Z - 0) (5.1-40) 

This is the same form as that of the MAP estimate. Equation (5.1-21), with P’ 1 set to zero. 

A particularly simple case occurs when C - I and D s 0. In this event, Equation (5.1-40) reduces to 
i * Z. 

Note that the expression (C*(G6*)' 1 C)' l C*(G6*)‘ 1 is a left-inverse of C; that is 

[(C*lGG*) -1 C) -1 C*(GG*)" 3 ]C « I (5.1-41) 

We can view the estimator given by Equation (5.1-40) as a pseudo- inverse of the system given by Equa- 
tion (5.1-1). Using both equations, write 

K = (C*(GG*)“ 1 C)" l C*(GG*)' 1 (C£ + D + Gid - D) 

= £ + (C*(GG*)’ 1 C)' 1 C*(GG*)’ 1 Gu3 


= £ + (C*(GG*)“ 1 C)" 1 C*G*' 1 w (5.1-42) 

Although we must use Equation (5.1-40) to compute £ because £ and w are not known. Equation (5.1-42) 
is useful in analyzing and understanding the behavior of the estimator. One interesting point is iranediately 
obvious from Equation (5.1-42): the estimate is simply the sum of the true value plus the effect of the con- 

taminating noise u. For the particular realization w = 0, the estimate is exactly equal to the true value. 
This property, which is not shared by the a posteriori estimators, is closely related to the bias. Indeed, 
the bias of the maximum likelihood estimator is immediately evident from Equation (5.1-42). 


bU) * EU|U - £ = 0 (5.1-43) 

The maximum likelihood estimate is thus unbiased. Note that Equation (5.1-32) for the MAP bias gives the same 
result if we substitute 0 for P“ l . 


Since the estimator is unbiased, the covariance and mean square error are equal. Using Equation (5.1-42), 
they are given by 


covUU) « mseUIO = (C*(GG*)- 1 C)- l C*G*' l G' l C(C*(GG*)‘ l C )’ 1 

= (C*(GG*)" 1 t)“ 1 (5.1-44) 

We can also obtain this result from Equations (5.1-33) and (5.1-34) for the MAP estimator by substituting 0 
for P“ J . 


We previously computed the Cramer-Rao bound for unbiased estimators for this problem (Equation 5.1-27)). 
The mean square error of the maximum likelihood estimator is exactly squal to the Cramer-Rao bound. The maxi- 
mum likelihood estimator is thus efficient and is, therefore, a minimum variance unbiased estimator. The 
maximum likelihood estimator is not, in general, Bayesian optical. Bayesian optimality may not even be 
defined, since £ need not be random. 

The MLE results for repeated experiments can be obtained from the corresponding MAP equations by substi- 
tuting zero for P' 1 and m^. We will not repeat these equations here. 

5.1.4 Comparison of Estimators 

We have seen that the maximum likelihood estimator is unbiased and efficient, whereas the a posteriori 
estimators are only asymptotically unbiased and efficient. On the other hand, the a posteriori estimators are 
Bayesian optimal for a large class of loss functions. Thus neither estimator emerges as an unchallenged 
favorite. The reader might reasonably expect some guidance as to which estimator to choose for a given 
problem. 

The roles of the two estimators arc actually quite distinct and well-defined. The maximum likelihood 
estimator does the best possible job (in the sense of minimum mean square error) of estimating the value of £ 
based on the measurements alone, without prejudice (bias) from any preconceived guess about the value. The 
maximum likelihood estimator Is thus the obvious choice when we have no prior information. Having no prior 
information is analogous to having a prior distribution with infinite variance; i.e., P“ l * 0. In this regard, 
examine Equation (5.1-16) for the a posteriori estimate as P" 1 goes to zero. The limit is (assuming that 
C*(GG*) _1 C Is nonsingular) 


A 
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5.1.4 


i* f (C*(GG*)" 1 C)“ 1 C*(GG*)" 1 (2 - Cm^ - 0) 

* - ( C* ( GG**" 1 C ) ’ 1 C* ( GG* ) “ 1 Cm^ + (C*(6G*)- l C)“ l C*(GG*)‘ l (Z - D) 

* (C*(GG*)’ 1 C)' 1 C^(GG*)" 1 (Z - 0) (5.1-45) 

which is equal to the maximum likelihood estimate. The iraximum likelihood estimate is thus a limiting case 
cf an a posteriori estimator as the variance of the prior distribution approaches infinity. 

The a posteriori estimate combines the information from the measurements with the prior information to 
obtain the optimal estimate considering both sources. This estimator makes use of more information and thus 
can obtain more accurate estimates, on the average. With this improved average accuracy comes a bias in favor 
of the prior estimate. If the prior estimate is good, the a posteriori estimate will generally be more accu- 
rate than the maximum likelihood estimate. If the prior- estimate is poor, the a posteriori estimate will be 
poor. The advantages of the a posteriori estimators thus depend heavily on the accuracy of the prior estimate 
of the value. 

The basic criterion in deciding whether to use an MAP or MLE estimator is whether you want estimates based 
only cn the current data or based on both the current data and the prior information. The MLE estimate is 
based only on the current data, and the MAP estimate is based on both the current data and the orior 
distribution. 

The distinction between the W.E and MAP estimators often becomes blurred in practical application. The 
estimators are closely related in numerical computation, as well as in theory. An MAP estimate can be an 
intermediate computational step to obtaining a final MLE estimate, or vice versa. The following paragraphs 
describe one of these situations; the other situation is discussed in Section 5.2.2. 

It i* quite common to have a prior guess of the parameters, but to desire an independent verification of 
the value based on the measurements alone. In this case, the maximum likelihood estimator is the appropriate 
tool in order to make the estimates independent of the initial guess. 

A two-step estimation is often the most appropriate to obtain maximum insight into a problem. First, use 
the maximum likelihood estimator to obtain the best estimates based on the measurements alone, ignoring any 
prior information. Then consider the prior information in order to obtain a final best estimate based on both 
the measurements and the prior information. By this two-step approach, we can see where the information is 
coming from- the prior distribution, the measurements, or both sources. The two-step approach also allows the 
freedom to independently choose the methodology for each step. For instance, we might desire to use a maximum 
likelihood estimator for obtaining the estimates based on the measurements, but use engineering judgment to 
establish the best compromise between the prior expectations and the maximum likelihood results. This is often 
the best approach because it may be difficult to completely and accurately characterize the prior information 
in terms of a specific probability distribution. The prior information often includes heuristic factors such 
as the engineer's judgment of what would constitute reasonable results. 

The theory of sufficient statistics (Ferguson, 1967; Cramer, 1946; and Fisher, 1921) is useful in the 
two-step approach if we desire to use statistical techniques for both steps. The maximum likelihood estimate 
and its covariance form a sufficient statistic for this problem. Although we will not go into detail here, 
if we know the maximum likelihood estimate and its covariance, we know all of the statistically useful informa- 
tion that can be extracted from the data. The specific application is that the a posteriori estimates can be 

written in terms of the maximum likelihood estimate and its covariance instead of as a direct function of the 

data. The following expression is easy to verify using Equations (5.1-16), (5.1-40), and (5.1-44): 

C a = m^ + (Q _I + P-M'MTMe* - m c ) (5.1-46) 

where £ a is the a posteriori estimate (Equation (5.1-16)), £mL is the maximum likelihood estimate (Equa- 

tion (5.1-40)), and Q is the covariance of the maximum likelihood estimate (Equation (5.1-44)). In this 
form, the relationship between the a posteriori estimate and the maximum likelihood estimate is plain. The 
prior distribution is the only factor which enters into the relationship; it has nothing directly to do with 
the measured data or even with what experiment was performed. 

Equation (5.1-46) is closely related to the measurement-partitioning ideas of the next section. Both 
relate to comtining data from two different sources. 


5.2 PARTITIONING IN ESTIMATION PROBLEMS 

Partitioning estimation problems has some of the same benefits as partitioning optimization problems. A 
problem half the size of the original typically takes well less than half the effort to solve. Therefore, we 
can often come out ahead by partitioning a problem into smaller subproblems. Of course, this trick only works 
if the solutions to the subproblems can easily be combined to give a solution to the original problem 

Two kinds of partitioning applicable to parameter estimation problems are measurement partitioning and 
parameter partitioning. Both of these schemes permit easy combination of the subproblem solutions in some 
situations. 

5.2.1 Measurement Partitioning 

A problem with multiple measurements can often be partitioned into a sequence of subproblems processing 
the measurements one at a time. The same principle applies to partitioning a vector measurement into a series 
of scalar (or shorter vector) measurements; the only difference is notational . 
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The estimators under discussion are all based on p(Zk) or, for a posteriori estimators, p(t|Z). Me 
will initially consider measurement partitioning as a problem in factoring these density functions. 

Let the measurement Z be partitioned into two measurements, Z, and Z 2 . (Extensions to more than two 
partitions follow the same principles.) We would like to factor p(z|0 into separate factors dependent on 
l 1 and Z 2 . By Bayes* rule, we can always write 

PttIO s p(Z a |Z lB Op(ZjO (5.2-1) 

This form does not directly achieve the required separation because p(Z 2 1 Zx involves both Z x and Z 2 . To 
achieve the required separation, we introduce the requirement that 

p(Z 2 |Z lt O = p(Z 2 |0 (5.2-2) 


We will call this the Markov criterion. 


Heuristically, the Markov criterion assures that p (Z x | ^ ) contains all of the useful information we can 
extract from 2 X . Therefore, having computed pUJc) at the measured value of 2 l% we have no further need 
for Z 1 . If the Markov criterion does not hold, then there are interactions that require Z x and Z. to be 
considered together instead of separately. For systems with additive noise, the Markov criterion implies that 
the noise in Z ? is independent of that in Z 1 . Note that this does not mean that Z 2 is independent of Z 1 . 

For systems where the Markov criterion holds, we can substitute Equation (5.2-2) into Equation (5.2-1) 
to get 

P(Z|0 * P(Z 2 U)P (ZJO (5.2-3) 

which is the desired factorization of p(Z|s). 

When z has a prior distribution, the Factorization of pU|Z) follows from that of p (Z | c ) * 

r\(7\e)n(c\ P(Z 2 k)p(Z k)p(0 

P(*|Z) - P(Z iHr } ■ - -*■ - p 7 7T (5.2-4) 

The mixing of and Z 2 in the p(Z) in the denominator is not important, because the denominator is merely 
a normalizing constant, independent of £. It will prove convenient to write Equation (5.2-4) in the form 


p(f|z) 


p (ZjUMdZj) 


(5.2-5) 


Let us now consider measurement partition of an MAP estimator for a system with pU|Z) factored as in 
Equation (5.2-5). The MAP estimate is 


i * arg max p(Z a | c)p(C (2 X ) (5.2-6) 

This equation is identical in form to Equation (4.3-18), with pU|Zj) playing the role of the prior distribu- 
tion. We have, therefore, the following two-step process for obtaining the MAP estimate by measurement 
partitioning: 

First, evaluate the posterior distribution of z given Z.. This is a function of rather than a single 
value. Practical application demands that this distribution be easily representable by a few statistics, but 
we put off such considerations until the next section. Then use this as the prior distribution for an MAP 
estimator with the measurement Z 2 . Provided that the system meets the Markov criterion, the resulting esti- 
mate should be identical to that obtained by the unpartitioned MAP estimator. 

Measurement partitioning of MLE estimator follows similar lines, except for some issues of interpretation. 
The MLE estimate for a system factored as in Equation (5.2-3) is 

l 8 arg max p(Z 2 k)p(ZjO (5.2-7) 

This equation is Identical in form to Equation (4.3-18), with p(Z 1 |c) playing the role of the prior distribu- 
tion. The two steps of the partitioned MLE estimator are therefore as follows: first, evaluate p( Z 2 ) c ) at 

the measured value of Z 19 giving a function of z* Then use this function as the prior density for an MAP 
estimator with measurement Z 2 . Provided that the system meets the Markov criterion, the resulting estimate 
should be identical to that obtained by the unpartitioned MLE estimator. 

The partitioned MLE estimator raises an issue of interpretation of pfZjO* It is not a probability 
density function of £. The vector z need not even be random. We can avoid the issue of z not being 
random by using information terminology, considering p(Z x j ^) to represent the state of our knowledge of z 
based on Z. instead of being a probability density function of £. Alternately, we can simply consider 
p(ZJO to be a function of z that arises at an intermediate step of computing the MLE estimate. The process 
described gives the correct MLE estimate of regardless of how we choose to interpret the intermediate 
steps. 

The close connection between MAP and MIE estimators is illustrated by the appearance of an MAP estimator 
as a step in obtaining the MLE estimate with partitioned measurements. The result can be interpreted either as 
an MAP estimate based on the measurement Z 2 and the prior density pUjO* or as an MLE estimate based on 
both and Z 2 . 
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5.2.2 Application to Linear Gaussian Systems 

We now consider the application of measurement partitioning to linear systens with additive Gaussian 
noise. We will first consider the partitioned MAP estimator, followed by the partitioned MLE estimator. 

Let the partitioned system be 


Z x * (5.2-8a) 

Z 2 = C 2 4 + 0 2 + G 2 uj 2 (5.2-8b) 

where and u> 2 are independent Gaussian random variables with mean 0 and covariance 1. The Markov criterion 

requires that <*> l and w 2 be independent for measurement partitioning to apply. The prior distribution of z 
is Gaussian with mean m^ and covariance P, and is independent of and w 2 . 

The first step of the partitioned MAP estimator is to compute pUjZ,). We have previously seen that this 
is a Gaussian density with mean and covariance given by Equations (5.1-12) and (5.1-13). Denote the mean and 
covariance of pU|Z x ) by m x and P x . Then, Equations (5.1-12) and (5.1-13) give 

m x = m^ + PC^C.PC* + G 1 G*r 1 (Z 1 - - D,) (5.2-9) 

P x = P - PC^(C 1 PC? + G 1 Gf)' 1 C 1 P (5.2-10) 

The second step is to compute the MAP estimate of z using the measurement Z 2 and the prior density 
pU|Z x ). This step is another application of Equation (5.1-12), using m : for m^ and P, for P. The 
result is 

i * m 2 * m l + PjCJfC^CJ + G 2 Gf)' l (Z 2 - C A - D 2 ) (5.2-11) 

The l defined by Equation (5.2-11) is the MAP estimate. It should exactly equal the MAP estimate 
obtained by direct application of Equation (5.1-12) to the concatenated system. You can consider Equa- 
tions (5.2-9) through (5.2-11) to be an algebraic rearrangement of the original Equation (5.1-12); indeed, they 
can be derived in such terms. 

Example 5.2-1 Consider a system 


Z = Z + w 

where w is Gaussian with mean 0 and covariance 1, and z has a Gaussian 
prior distribution with mean 0 and covariance 1. We make two independent 
measurements of Z (i.e., the two samples of m are independent) and desire 
the MAP estimate of z. Suppose the l x measurement is 2 and the Z 2 
measurement is -1. 


Without measurement partitioning, we could proceed as follows: write the 

concatenated system 



Directly apply Equation (5.1-12) with nir s 0, P * 1, C s [1 1]*, D = 0, 

G * 1, and Z - [2, -1]*. The MAP estimat? is then 

*•“ n C T 1 

= i (Z, + z 2 ) - j 

Now consider this same problem with measurement partitioning. To get p (^ | Z x ) , 
apply Equations (5.2-9) and (5.2-10) with m f = 0, P = 1, C x * 1, D x = 0, 

Gi * 1, and Z x * 2. 

m 1 « 1(2)*% = \ Zi « 1 
Pi * 1 - 1 ( 2)-*1 * | 

For the second step, apply Equation (5.2-11) with m 1 ** 1, P x = 1/2, C 2 - 1, 

D 2 * 0, G 2 * 1, and Z 2 * -1. 

« * * + ?( 1 i)’ 1(z * - *> * j z * + ! a j 

We see that the results of the two approaches are identical in this example, 
as claimed. Note that the partitioning removes the requirement to Invert a 
2-by-2 matrix, substituting two 1-by-l Inversions. 
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The computational advantages of using the partitioned form of the MAP estimator vary depending on 
numerous factors. There are numerous other rearrangements of Equations (5.1-12) and (5.1-13). The Information 
form of Equations (5.1-14) and (5.1-15) Is often preferable if the required inverses exist. The information 
form can also be used in the partitioned estimator, replacing Equations (5.2-9) through (5.2-11) with corre- 
sponding information forms. Equation (5.1-30) is another alternative, which Is often the most efficient. 

There is at least one circumstance in which a partitioned form is mandatory. This is when the data 
comes in two separate batches and the first batch of data must be discarded (for any of several reasons- per- 
haps unavailability of enough computer memory) before processing the second batch. Such circumstances occur 
regularly. Partitioned estimators are also particularly appropriate when you have already computed the esti- 
mate based on the first batch of data before receiving the second batch. 

Let us now consider the partitioned MLE estimator. The first step is to compute p(2 x 1 4 ) - Equa- 
tion (5.1-38) gives a formula for p(Z x |t). It is immediately evident that the logarithm of p (Z x | c) is a 
quadratic form in £. Therefore, although p(Z 1 | ^ ) need not be interpreted as a probability density function 
of £, It has the algebraic form of a Gaussian density function, except for an irrelevant constant multiplier. 
Applying Equations (3.5-17) and (3.5-18) gives the mean and covariance of this function as 

m, » P x Ct(G x G tr 1 (Z l - 0 X ) (5.2-12) 

P, = -[v* m p(Z 1 |c)]‘ 1 = (5.2-13) 

The second step of the partitioned MLE estimator is identical to the second step of the partitioned MAP 
estimator. Apply Equation (5.2-11), using the m : and P 2 from the first step. For the partitioned MLE 
estimator, it is most natural (although not required) to use the information form of Equation (5.2-11), 
which is 

K - m, + P 2 C*(G 2 G*p(Z 2 - C 2 m 1 - D 2 ) (5.2-14) 

P 2 * tCJ(G 2 GJ)" 1 C 2 + P^]- 1 (5.2-15) 

This form is more parallel to Equations (5.2-12) and (5.2-13). 

Example 5.2-2 Consider a maximum likelihood estimator for the problem of 
Example 5.2-1, ignoring the prior distribution of £. To get the MLE 
estimate for the concatenated system, apply Equation (5.1-40) with 
C * [1 1]*, D - 0, G - 1, and Z - [2, -1>. 

t = (2)' 1 [l 1]Z * | (Z, ♦ Z 2 ) = | 

Now consider the same problem with measurement partitioning. For the first 
step, apply Equations (5.2-12) and (5.2-13) with C, ■ 1, D, = 0, G. s 1, 
and l l * 2. 

Px * [l(l )- 1 ]- 1 * 1 

nij. = PjdJ-MZj - 0) = Zj = 2 

For the second step, apply Equations (5.2-14) and (5.2-15) with C 2 * 1, 

Dj = 0, G 2 * 1, and Z 2 ~ -1. 

P, = [HD ' 1 + (I )- 1 ]' 1 * \ 
i - 2 + \ (l) _ 1 (z 2 - 2 - 0) = 1 + |z 2 • £ 

The partitioned algorithm thus gives the same result as the original 
unpartitioned algorithm. 

There Is often confusion on the Issue of the bias of the partitioned MLE estimator. This is an MLE esti- 
mate of c based on both Z 1 and l z . It Is, therefore, unbiased like all MLE estimators for linear systems 
with additive Gaussian noise. On the other hand, the last step of the partitioned estimator is an MAP estimate 
based on Z 2 , with a prior distribution described by m 1 and P 2 . We have previously shown that MAP estimators 
are biased. There is no contradiction in these two viewpoints. The estimate is biased based on the measure- 
ment Z 2 alone, but unbiased based on Z r and Z 2 . 

Therefore, it is overly simplistic to universally condemn MAP estimators as biased. The bias is not 
always so clear an issue, but requires you to define exactly on what data you are basing the bias definition. 
The primary basis for deciding whether to use an MAP or MLE estimator is whether you want estimates based only 
on the current set of data, or estimates based on the current data and prior information combined. The bias 
merely reflects this decision; it does not give you independent help in deciding. 

5.2.3 Parameter Partitioning 

In parameter partitioning, we write the parameter vector £ as a function of two (or more- the general- 
izations are obvious) smaller vectors £ 2 and £ 2 * 

C - fUx.Ej) (5.2-16) 
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The function f must be Invertible to obtain and s 2 from s, or the solution to the partitioned problem 
will not be unique. The simplest kind of partitions are those in which ^ and c 2 are partitions of the 
C vector. 

With the parameter e partitioned into and £, , we have a partitioned optimization problem. Two 
possible solution methods apply. The best method, If it can be used, is generally to solve for £ L in terms 
of z 2 (or vice versa) and substitute this relationship into the original problem. Axial iteration is another 
reasonable method If solutions for and r, z are nearly Independent so that few Iterations are required. 


5.3 LIMITING CASES AND SINGULARITIES 

In the previous discussions, we have simply assumed that all of the required matrix inverses exist. We 
made this assumption to present some of the basic results without getting sidetracked on fine points. We will 
now take a comprehensive look at all of the singularities and limiting cases, explaining both the circumstances 
that give rise to the various special cases, and how to handle such cases when they occur. 

The reader will recognize that most of the special cases are idealizations which are seldom literally 
true. We almost never know any value perfectly (zero covariance). Conversely, it is rare to have absolutely 
no information about the value of a parameter (infinite covariance). There are very few parameters that would 
not be viewed with suspicion if an estimate of, say, 10 s56 were obtained. These idealizations are useful in 
practice for two reasons. First, they avoid the necessity to quantify statements such as "virtually perfect" 
when the difference between virtually perfect and perfect is not of measurable consequence (although one must 
be careful: sometimes even an extremely small difference can be crucial). Second, numerical problems with 

finite arithmetic can be alleviated by recognizing essentially singular situations and treating them specially 
as though they were exactly singular. 

We will address two kinds of singularities. The first kind of singularity involves Gaussian distributions 
with singular covariance matrices. These are perfectly valid probability distributions conforming to the usual 
definition. The distributions, however, do not have density functions; therefore the maximum a posteriori 
probability and maximum likelihood estimates cannot be defined as we have done. The singularity implies that 
the probability distribution is entirely concentrated on a subspace of the originally defined probability 
space. If the problem statement is redefined to include only the subspace, the restricted problem is nonsingu- 
lar. You can also address this singularity by looking at limits as the covariance approaches the singular 
matrix, provided that the limits exist. 

The second kind of singularity involves Gaussian variables with infinite covariance. Conceptually, the 
meaning of infinite covariance is easily stated-we have no information about the value of the variable (but 
we must be careful about generalizing this idea, particularly in nonlinear transformations- see the discussion 
at the end of Section 4.3.4). Unluckily, infinite covariance Gaussians do not fit within the strict defini- 
tion of a probability distribution. (They cannot meet axiom 2 in Section 3.1.1.) For current purposes, we 
need only recognize that an infinite covariance Gaussian distribution can be considered as a limiting case (in 
some sense that we will not precisely define here) of finite covariance Gaussians. The term "generalized 
probability distribution" is sometimes used in connection with such limiting arguments. The equations which 
apply to the infinite covariance case are the limits of the correspondino finite covariance cases, provided 
that the limits exist. The primary concern in practice is thus how to compute the appropriate limits. 

We could avoid several of the singularities by retreating to a higher level of abstraction in the mathe- 
matics. The theory can consistently treat Gaussian variables with singular covariances by replacing the con- 
cept of a probability density function with the more general concept of a Radon-Nikodym derivative. (A 
probability density function is a specific case of a Radon-Nikodym derivative.) Although such variables do 
not have probability density functions, they do have Radon-Nikodym derivatives with respect to appropriate 
measures. Substituting the more general and more abstract concept of a-finite measures in place of probabil- 
ity measures allows strict definition of infinite covariance Gaussian variables within the same context. 

This level of abstraction requires considerable depth of mathematical background, but changes little in 
ihe practical application. We can derive the identical computational methods at a lower level of abstraction. 
The abstract theory serves to place all of the theoretical results in a comnon framework. In many senses the 
general abstract theory is simpler than the more concrete approach; there are fewer exceptions and special 
cases to consider. In implementing the abstract theory, the same computational issues arise, but the simpli- 
fied viewpoint can help indicate how to resolve these Issues. Simply knowing that the problem does have a 
well-defined solution is a major aid to finding the solution. 

The conceptual simplification gained by the abstract theory requires significantly more background than 
we assume in this book. Our emphasis will be on the computations required to deal with the singularities, 
rather than on the abstract theory. Royden (1968), Rudin (1974), and Llpster and Shiryayev (1977) treat such 
subjects as o-finlte measures and Radon-Nikodym derivatives. 

We will consider two general computational methods for treating singularities. The first method is to 
use alternate forms of the equations which are not affected by the singularity. The covariance form (Equa- 
tions (5.1-12) and (5,1-13)) and the Information form (Equations (5.1-14) and (5.1-15)) of the posterior 
distribution are equivalent, but have different points of singularity. Therefore, a singularity in one form 
can often be handled simply by switching to the other form. This simple method falls if a problem statement 
has singularities in both forms. Also, we may desire to stick with a particular form for other reasons. 

The second method is to partition the estimation problem into two parts: the totally singular part and 

the nonsingular part. This partitioning allows us to use one means of solving the singular part and another 
means of solving the nonsingular part; we then combine the partial solutions to give the final result. 
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5.3.1 Singular P 

The first case that we will consider is singular P. A singular P matrix indicates that some parameter 
or linear combination of parameters Is known perfectly before the experiment is performed. For instance, we 
might know that i 1 • 5 £ 2 + 3, even though and £ 2 are unknown. In this case, we know the linear combina- 
tion £ x - 5 £ 2 exactly. The singular P matrix creates no problems If we use the covariance form instead of 
the information form. If we specifically desire to use the information form, we can handle the singularity as 
follows. 

Since P is always symmetric, the range and the null space of P form an orthogonal decomposition of the 
space z. The singular eigenvectors of P span the null space, and the nonsingular eigenvectors span the 
range. Use the eigenvectors to decompose the parameter estimation problem into the totally singular subproblem 
and the totally nonsingular subproblem. This is a parameter partitioning as discussed In Section 5.2. The 
totally singular subproblem is trivial because we know the exact solution when we start (by definition). Sub- 
stitute the solution of the singular problem in the original problem and solve the nonsingular subproblem in 
the normal manner. 

A specific implementation of this decomposition is as follows: let X 3 be the matrix of orthonormal 
singular eigenvectors of P, and Xr$ be the matrix of orthonormal nonsingular eigenvectors. Then define 

C s = X*c (5.3-la) 

^NS * X NS e (5.3-lb) 

The covariances of £$ and £ NS are 

cov U $ ) = X*PX $ - 0 (5.3-2a) 

C0V <W * x NS PX NS * P NS (5 ' 3 ' 2b) 

where P^ is nonsingular. Write 

5 * X N 5 ^NS + V'S (5.3-3) 

Substitute Equation (5.3-3) into the original problem. Use the exactly known value of 5 $, and restate the 
problem in terms of as the unknown parameter vector. Other decompositions derived from multiplying 

Equation (5.3-1) by nonsingular transformations can be used if they have advantages for specific situations. 

We will henceforth assume that P is nonsingular. It is unimportant whether the original problem 
statement is nonsingular or we are worxing with the nonsingular subproblem. 

The implementation above is defined in very general terms, which would allow it to be done as an auto- 
matic computer subroutine. In practice, we usually know the fact of and reason for the singularity beforehand 
and can easily handle It more concretely. If an equation gives an exact relationship between two or more 
variables which we know prior to the experiment, we solve the equation for one variable and remove that 
variable from the problem by substitution. 

Example 5.3- 1 Assume that the output of a system is a known function of the 
applied force and moment 


Z = f (r,M) 

An unknown point force is applied at a known position r referred to the 
origin. We thus know that 


M = r * F 

If F and M are both considered as unknowns, the P matrix Is singular. 
But this singularity Is readily removed by substituting for M in terms of 
F so that F is the only unknown. 

Z = f (F,r x F) * f 2 (F) 


5.3.2 Singular GG* 

The treatment of singular GG* is similar In principle to that of singular P. A singular GG* matrix 
implies that some measurement or combination of measurements is made perfectly (i.e., noise-free). The 
covariance form does not involve the Inverse of GG*, and thus can be used with no difficulty when GG* Is 
singular. 

An alternate approach involves a sequential decomposition of the original problem into totally singular 
(GG* * 0) and nonsingular subproblems. The totally singular subproblem must be handled in the covariance formi 
the nonsingu^r subproblem can then be handled in either form. This Is a measurement partitioning as 
described in Section 5.2. Divide the measurement Into two portions, called the singular and the nonsingular 
measurements, Z$ and Zrs* First Ignore Zc and find the posterior distribution of £ given only Zwj. Then 
use this result as the distribution prior to Z 5 . We specifically Implement this decomposition as follows: 

For the first step of the decomposition, let Xrs be the matrix of nonsingular eigenvectors of GG*. 
Multiply Equation (5.1-1) on the left by XR 5 giving 
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Define 


X NS Z - X NS CC + + X fis G “ 


Z NS ' % Z 


5.3.2 

(5.3-4) 


'•NS 


U NS 


X fi$ C 


(5.3-5) 


Equation (5.3-4) then becomes 


Ss * x ns g 


Z NS * C NS e + °NS + W 


(5.3-6) 


Note that 
of £ condit 


G^cSSs Is nonsingular. Using the information form for the posterior distribution, the distribution 
litioned on Zn$ is 

"NS * EU|Z ns > . m £ ♦ (Cj5 s (G NS Gj5 s )-^C NS + ‘ C NS » e - W <S.3-7t> 

P NS * cov{t l Z N$ ; * t C NS< G NS G i5sJ' 1C N S + P'**' 1 ( 5 - 3 ’ 7b ) 


For the second step, let X$ be the matrix of singular eigenvectors of GG*. Corresponding to 
Equation (5.3-6) is 


where 


z s * c s 5 + °s + G s“ 


z s - x s z 
C s - x*c 

D $ . X|D 
G $ * X*G * 0 


(5.3-8) 


(5.3-9) 


Use Equation (5.3-7) for the prior distribution for this step. Since G$ is 0, we must use the covariance 
fonn for the posterior distribution, which reduces to 


EU|Z> * + p ns C S {C S P NS C S , " 1(Z S “ C S m NS ’ °S } (5.3-10a) 

C0V{ 5 |Z) - P NS ♦ PbCUVbCIJ’Vk (5 - 3 - 10b) 

Equations (5.3-4), (5.3-6), (5.3-8), and (5.3-10) give an alternate expression for the posterior distribution 
of 5 given Z which we can use when GG* is singular. It does require that CsPns^s be nonsingular. 

This is a special case of the requirement that CPC* + GG* be nonsingular, which we discuss later. It Is 
interesting to note that the covariance (Equation (5.3-10b)) of the estimate Is singular. Multiply 
Equation (5.3-lOb) on the right by C£ and obtain 

P NS C S * P NS C S< C S P NS C S>‘Vn 5 C S * P NS C S * P NS C S * 0 (b ' 3 - U > 

Therefore the columns of C§ are all singular eigenvectors of the covariance of the estimate. 

5.3.3 Singular CPC* + GG* 

The next special case that we will consider is when CPC* + GG* is singular. Note first that this can 
happen only when GG* is also singular, because CPC* and GG* are both positive semi -definite, and the sum 
of two such matrices can be singular only If both tenns are singular. Since both GG* and CPC* + GG* are 
singular, neither the covariance form nor the information form circumvents the singularity. In fact, there Is 
no way to circumvent this singularity. If CPC* + GG* is singular, the problem Is Intrinsically ill-posed. 
The only solution is to restate the original problem. 

If we examine what Is Implied by a singular CPC* + GG*, we will be able to see why it necessarily means 
that the problem is ill -posed, and what kinds of changes in the problem statement, are required. Referring to 
Equation (5.1-8), we see that CPC* ♦ GG* is the covariance of the measurement Z. GG* Is the contribution 
of the measurement noise to this covariance, and CPC* Is the contribution due to the prior variance of $. 

If CPC* + GG* Is singular, we can exactly predict some part of the measureo response. For this to occur, 
there must be neither measurement noise nor parameter uncertainty affecting that particular part of the 
response. 
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Clearly, there are serious mathematical difficulties In saying that we know exactly what the measured 
value will be before taking the measurement. At best* the measurement can agree with what we predicted, which 
adds no new Information. If, however, there Is any disagreement at all, even due to rounding error In the 
computations, there Is an Irresolvable contradlctlon-we said that we knew exactly what the value would be and 
we were wrong. This Is one situation where the difference between almost perfect and perfect Is extremely 
Important. As CPC* + GG* approaches singularity, the corresponding estimators diverge; we cannot talk about 
the limiting case because the estimators do not converge to a limit In any meaningful sense. 

5.3.4 Infinite P 


Up to this point, the special cases considered have all involved singular covariance matrices, correspond- 
ing to perfectly known quantities. The remaining special cases all concern limits as eigenvalues of a covar- 
iance matrix approach Infinity, corresponding to total Ignorance of the value of a quantity. 

The first such special case to dlsruss Is when an eigenvalue of P approaches Infinity. The problem Is 
much easier to discuss In terms of the Information matrix P* 1 . As an eigenvalue of P approaches infinity, 
the corresponding eigenvalue of P" 1 approaches zero. At the limit, P’ 1 is singular. To be cautious, we 
should not speak of P' 1 being singular but only of the limit as P" 1 goes to a singularity, as It Is not 
meaningful to say that P" 1 Is singular. Provided that we use the information form everywhere, all of the 
limits as P" 1 goes to a singularity are well-behaved and can be evaluated simply by substituting the singular 
value for P* 1 . Thus this singularity poses no difficulties in practice, as long as we avoid the use of 
expressions Involving a nonlnverted P. As previously mentioned, the limit as P” 1 goes to zero Is particu- 
larly interesting and results in estimates Identical to the maximum likelihood estimates. Using a singular 
P' 1 is paramount to saying that there Is no prior Information about some parameter or set of parameters (or 
that we choose to discount any such Information In order to obtain an Independent check). There Is no con- 
venient way to decompose the problem so that the covariance form can be used with singular P” 1 matrices. 

The meaning of a singular P -1 is most clearly illustrated by some exanples using confidence regions. A 
confidence region Is the area where the probability density function (really a generalized probability density 
function here) Is greater than or equal to some given constant. (See Chapter 11 for a more detailed discussion 

two elements, ^ and ; 2 . Assume that the prior 


i C 2 


where C x and C 2 are constants depending on the level of confidence desired. For current purposes, we are 
Interested only in the shape of the confidence region, which is independent of the values of the constants. 
Figure (5.3-1) Is a sketch of the shape. Note that this confidence region is a limiting case of an ellipse 
with major axis length going to infinity while the minor axis Is fixed. This prior distribution gives Infor- 
mation about tj, but none about 

Now consider a second example, which Is Identical to the first except that 



In this case, the prior confidence region Is 



or 

+ c! - 2£it2 * Cj 

or 

U x - tj ) 2 i Cj 

Figure (5.3-2) Is a sketch of the shape of this confidence region. In this case, the difference between 
and e 2 Is known with some confidence, but there Is no Information about the sum £, + £,. The singular 
eigenvectors of P“ l correspond to directions In the parameter space about which there is no prior knowledge. 


of confidence regions.) Let the parameter vector consist of 
distribution has mean zero and 


The prior confidence regions are given by 
or equivalently 

which reduces to 


n 


[Ci £ 


pCc) t Ci 


■f. I: 
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5.3.5 Infinite GG* 


Corresponding to the case where P* 1 approaches a singular point Is the similar case where (GG*)* 1 
approaches a singularity. As in the case or singular P” 1 , there are no computational problems. We can 
readily evaluate all of the limits simply by substituting the singular matrix for (GG*) -1 . The Information 
form avoids the use of a nonlnverted GG*. A singular (GG*) -1 matrix would Indicate that some measurement or 
linear combination of measurements had Infinite noise variance, which Is rather unlikely. The primary use of 
singular (GG*)* 1 matrices In practice Is to make the estimator Ignore certain measurements If they are worth- 
less or slmplv unavailable. It Is mathematically cleaner to rewrite the system model so that the unused 
measurements are not Included in the observation vector, but It Is sometimes more convenient to simply use a 
singular (GG*)” 1 matrix. The two methods give the same result. (Not having a measurement at all is equiva- 
lent to having one and Ignoring It.) One Interesting specific case occurs when (GG*)“ l approaches 0. This 
method then amounts to Ignoring all of the measurements. As might be expected, the a poeteriovi estimate is 
then the same as the a priori estimate. 

5.3.6 Singular C*(GG*) -1 C + P’ 1 

The final special case to be discussed Is when the C*(GG*)" 1 C + P” 1 In the Information form approaches 
a singular value. Note that this can occur only If P” 1 Is also approaching a singularity. Therefore, the 
problem cannot be avoided by using the covariance form. If C*(GG*)’ 1 C + P* 1 Is singular, it means that there 
Is no prior Information about a parameter or combination of parameters, and that the experiment added no such 
information. The difficulty, then, Is that there Is absolutely no basis for estimating the value of the singu- 
lar parameter or combination. The system Is referred to as being unidentifiable when this singularity Is 
present. Identlflabll Ity Is an Important issue In the theory of parameter estimation. The easiest computa- 
tional solution Is to restate the problem, deleting the parameter in question from the list of unknowns. 
Essentially the same result comes from using a pseudo-inverse In Equation (5.1-14) (but see the discussion in 
Section 2.4.3 on the blind use of pseudo-inverses to "solve" such problems). Of course, the best alternative 
Is often to examine why the experiment gave no Information about the parameter, and to redesign the experiment 
so that a usable estimate can be obtained. 


5.4 NONLINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE 

The general form of the system equations for a nonlinear system with additive Gaussian noise is 

7 « f(£,U) + G(U)u (5.4-1) 

As In the case of linear systems, we will define by convention the mean of u to be zero and the covariance 
to be identity. If £ is random, we will assume that it is independent of u and has the distribution given 
by Equation (5.1-3). 

5.4.1 Joint Distribution of Z and £ 

To define the estimators of Chapter 4, we need to know the distribution P(Z|£,U). This distribution Is 
easily derived from Equation (5.4-1). The expressions f(£,U) and G(U) are both constants If conditioned on 
specific values of £ and U. Therefore we can apply the rules discussed in Chapter 3 for multiplication of 
Gaussian vectors by constants and addition of constants to Gaussian vectors. Using these rules, we see that 
the distribution of Z conditioned on £ and U Is Gaussian with mean f(£,U) and covariance G(U)G(U)*. 

p(ZU.U) ■ |2.G(U)G(U)*|-» /j exp{- | [Z - f (C.UlWGUDGdDT’CZ _ f( t ,u)]| (5.4-2) 

This Is the obvious nonlinear general izauon of Equation (5.1-6); the nonlinearity does not change the basic 
method of derivation. 

If C Is random, we will need to know the joint distribution p(Z,£|U). The joint distribution is com- 
puted by Bayes rule 

p(Z,£|U) - p(Z|c,U)p(;|U) (5.4-3) 

Using Equations (5.1-3) and (5.4-2) gives 

P(Z,C|U) • [ 1 2wP | |2nGG*|]* l/l exp{- | [Z - fU.lOMGODGOl)*]- 1 ^ - f( t .U)] 

- \ [t - ni t ]*P* l [t - n. e ]J (5.4-4) 

Note that p(Z , 4 | U) Is not, in general, Gaussian. Although Z conditioned on £ Is Gaussian, and ; 
is Gaussian, Z and £ need not be jointly Gaussian. This is one of the major differences between linear and 
nonlinear systems with additive Gaussian noise. 

Exa mple 5.4-1 Let Z and £ be scalars, P * 1, m f ■ 0, G(U) ■ 1, and 
f ( £ ,U) ■ £* . Then 

p(Z|c.U) * (2*r l/j exp|- \ (Z - 5 2 ) 2 } 
and 

p(t | U) • (2») _l/ * expj- | t s J 

p(Z.t|U) - (2.)-* exp{- \ tc 1 + (2 - C 2 ) 2 ]} 


This gives 



5.4.1 


The general form of a joint Gaussian distribution for two variables Z and £ Is 
p(Z,£) ■ a exp{b£ 2 + cZ 2 + dZO 
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where a, b, c, and d are constants. The joint distribution of Z tnu £ 
cannot be manipulated Into this form because a r* term appears in the 
exponent. Thus Z and £ are not jointly Gaussian, even though Z condi- 
tioned on £ Is Gaussian and £ Is Gaussian. 


Given Equation (5.4-4), ve can compute the marginal distribution of Z, and the conditional distribution 
of £ given Z from the equations 


ptf) ■ J 

p(Z,5)d? 

(5.4-5) 

P(C|2) 

•« 

(5.4-6) 


The Integral In Equation (5.4-5) Is not ea:y to evaluate in general. Since p(Z,0 Is not necessarily 
Gaussian, or any other standard distribution, the only general means of computing p(Z) is to numerically 
integrate Equation (5.4-5) for a grid of Z values. If £ and Z are vectors, this can be a quite formidable 
task. Therefore, we will avoid the use of p(Z) and P(£|Z) for nonlinear systems. 


5.4.2 Estimators 


The a posteriori expected value and Bayes optimal estimators are seldom used for nonlinear systems because 
their computation Is difficult. Computation of the expected value requires the numerical Integration of 
Equation (5.4-5) and tie evaluation of Equation (5.4-6) to find the conditional distribution, and then the 
Integration of £ times the conditional distribution. Theorem (4.3-1) says that the Bayes optimal estimator 
for quadratic loss is equal to the a Posteriori expected value estimator. The computation of the Bayes optimal 
estimates 'equlres the same or equivalent multidimensional integrations, so Theorem (4.3-1) does not provide us 
with a simplified means of computing the estimates. 

Since the posterior distribution of £ need not be symmetric, the MAP estimate Is r.ot equal to the 
a posteriori expected value for nonlinear systems. The MAP estimator does not require the use of Equa- 
tions (5.4-5) and (5.4-6). The MAP estimate Is obtained by maximizing Equation (5.4-6) with respect to £. 
Since p(Z) Is not a function of £, we can equivalently maximize Equation (5.4-4). For general, nonlinear 
systems, we must do this maximization using numerical optimization techniques. 

It Is usually convenient to work with the logarithm of Equation (5.4-4). Since standard optimization con- 
ventions are phrased In terms of minimization, rather than maximization, we will state the problem as minimiz- 
ing the negative of the logarithm of the probability density. 

-w p(Z.5|U) - j [Z - f(4,U)3*{6G*)- 1 CZ - fU.U)] + \ [« - m^p-'U - m f ] + | m[|2.P| |2»GG*1J 

(6.4-7) 

Since the last term of Equation (5.4-7) is a constant, It does not affect the optimization. We can there- 
fore define the cost functional to be minimized as 

JU) * \ [2 - f (C,U)]*(GG*)‘ l [Z - f(;,U)] ♦ i tC - - m { ] (5.4-8) 

We have omitted the dependence of J on Z and U from the notation because it will be evaluated for specific 
Z and U In application; £ is the only variable with respect to which we are optimizing. Equation (5.4-8) 
makes It clear that the MAP estimator Is also a least-squares estimator for this problem. The ( GG *)" 1 and 
P' 1 matrices are weightings on the squared measurement error and the squared error In the prior estimate of 
£, respectively. 

For the maximum likelihood estimate we maximize Equation (5.4-2) Instead of Equatlc/i (5.4-4). As In the 
case of linear systems, the maximum likelihood estimate Is equal to the limit of the MAP estimate as P’ 1 goes 
to zero; i.e., the last term of Equation (5.4-8) Is omitted. 

For a single measurement, or even for a finite number of measurements, the nonlinear MAP and MLE esti- 
mators have none of the optimality properties discussed In Chapter 4. The estimates are neither unbiased, 
minimum variance, Bayes optimal, or efficient. When there are a large number of measurements, the differences 
from optimality are usually small enough to Ignore for practical purposes The main benefits of the nonlinear 
MLE and MAP estimators are their relative ease of computation and their links to the Intuitively attractive 
Idea of least squares. These links give some reason to suspect that even If some of the assumptions about the 
noise distribution are questionable, the estimators still make sense from a nonstatist ical viewpoint. The 
final prartlcal judgment of an estimator Is based on whether the estimates are adequate for their Intended 
use, rather than on whether they are exactly optimum. 

The extension of Equation (5.4-8) to multiple Independent experiments is straightforward. 

N 

JU) * j- Yj [Z 1 ' f (C.U 1 )]*(GG*)‘ 1 [Z 1 - fU.U,)] ♦ | [£ - - » C J 

1*i 


(5.4-9) 


where N Is the number of experiments performed. The maximum likelihood estimator is obtained by omitting 
the last term. The asymptotic properties are defined as N goes to Infinity. The maximum likelihood esti- 
mator can be shown to be asymptotically unbiased and asymptotically efficient (and thus also asymptotically 
minimum-variance unbiased) under quite general conditions. The estimator Is also consistent. The rigorous 
proofs of these properties (Cramer, 1946), although not extremely difficult, arc fairly lengthy and will not 
be presented here. The only condition required Is that 

N 

1*i 


converge to a positive definite matrix. Cramer (1946) also proves tnat the “stl-nates asymptotically approach 
a Gaussian distribution. 

Since the maximum likelihood estimates are asymptotically efficient, the Cramer-Rao Inequality (Equa- 
tion (4.2-20)) gives a good estimate of the covariance of the estimate for large N. Using Equation (4.2-19) 
for the Information matrix gives 


mu) • Y E<[yu.u,)]*(GG»)* l u - fu.yHz - fu.yxe^r^yu.u,)]) 

N 

■ Z - 'u.yHz - fu.y^HK^-^fu.y] 

!■> 

N 


i«i 


Y [ v t f (E. u 1 )]'(GG*)‘ l [V t f(t,U 1 )] 

1*i 


The covariance of the maximum likelihood estimate thus approximated by 

covUIO • /]£ [v t f(t.U i )]*(GG*)- 1 [v e 


u.yil 


(5.4-10) 


(5.4-11) 


When £ has a prior distribution, the corresponding approximation for the covariance of the posterior distri- 
bution of £ is 


CQV(t J Z) a 


jz [yft.u^^tGG^-Hytc.u^i + 


(5.4-12) 


5.4.3 Computation of the Estimates 

The discussion of the previous section did not address the question of how to compute the MAP and ML 
estimates. Equation (5.4-!*) (without the last term for the MlE) is the cost functional to minimize. Minimi- 
zation nf such nonlinear functions can be a difficult proposition, as discussed in Chapter 2. 

Equation (5.4-9) Is In the form of a sum of squares. Therefore the Gauss-Newton method is often the best 
choice of optimization method. Chapter 2 discussed the details of the Gauss-Newton method. The probabilistic 
background of Equation (5.4-9) allows us to apply tiie central limit theorem to strengthen one of the arguments 
used to support the Gauss-Newton method. 

For simplicity, assume that all of the Uj are Identical. Compare the limiting behavior of the two terms 
of the second gradient, as expressed by Equation (2.5-10). The term retained by the Gauss-Newton approximation 
Is N[7^f] # (GG*)* l [Vrf] , which grows linearly with N. At the true value of £, Zj - f(t,U<|) is a Gaussian 
random variable with mean 0 and covariance GG*. Therefore, the omitted term of the second gradient Is a sum 
of independent, Identically distributed, random variables with zero mean. By the central limit theorem, the 
vai lance of 1/N times this term goes to zero as N goes to Infinity. Since 1/N times tho retained term 
goes to a nonzero constant, the omitted term Is small compared to the retained one for large ri. This conclu- 
sion Is still true If the Uj are not Identical, as long as f and its gradients are bounded and the first 
gradient does not converge to zero. 

This demonstrates that for large N the omitted term is small compared to the retained term If £ Is at 
the true value, and, by continuity. If £ is sufficiently close to the true value. When £ is far from the 
true value, the arguments of Chapter 2 apply. 
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5.4.4 Singularities 

The singular cases which arise for nonlinear systems are basically the same as. for linear systems and have 
similar solutions. Limits as P’ 1 or (GG*)' 1 approach singular values pose no difficulty. Singular P or GG* 
matrices are handled by reducing the problem to a nonsingular subproblem as in the linear case. 

The one singularity which merits some additional discussion in the nonlinear case corresponds to singular 

N 

Y C^(GG*) -1 C 1 + P' 1 
i=i 

matrix in the nonlinear c^se, if we use the Gauss-Newton algorithm, is 
N 

* Y [yu.Mpiwr^yu.u,)] ♦ p- 1 (5.4-13) 

i-i 

If Equation (5.4-13) is singular at the true value, the system is said to be unidentifiable. We discussed the 
computational problems of this singularity in Chapter 2. Even if the optimization algorithm correctly finds a 
unique minimum, Equatior (5.4-11) indicates that the covariance of a maximum likelihood estimate would be very 
large. (The covariance is approximated by the inverse of a nearly singular matrix.) Thus the experimental 
data contain very little information about the value of sene parameter or combination of parameters. Note that 
the covariance estimate is unrelated to the optimization algorithm; changes to the optimization algorithm 
might help you find the minimum, but will not change the properties of the resulting estimates. The singular- 
ity can be eliminated by using a prior distribution with a positive definite P' 1 , but in this case, the esti- 
mated parameter values will be strongly influenced by the prior distribution, since the experimental data is 
lacking in information. 

As with linear systems, unidentifiability is a serious problem. To obtain usable estimates, it is gener- 
ally necessary to either reformulate the problem or redesign the experiment. With ronl inear systems, we have 
the additional difficulty of diagnosing whether identifiability problems are present or not. This difficulty 
arises because Equation (5.4-13) is a function of £ and it is necessary to evaluate it at or near the minimum 
to ascertain whether the system is identifiable. If the system is not identifiable, it may be difficult for 
the algorithm to approach the (possibly nonunique) minimum because of convergence problems. 

5.4.5 P artitioning 

In both theory and computation, parameter estimation is much more difficult for nonlinear than for linear 
systems. Therefore, means of simplifying parameter estimation problems are particularly desirable for non- 
linear systems. The partitioning ideas of Section 5.2 have this potential for some problems. 

The parameter partitioning ideas of Section 5.2.3 make no linearity assumptions, and thus apply directly 
to nonlinear problems. We have little more to add to the earlier discussion of parameter partitioning except 
to say that parameter partitioning is often extremely important in nonlinear systems. It can make the critical 
difference between a tractable and an intractable problem formulation. 

Measurement parti tio. mg, as formulated in Section 5.2.1, is impractical for most nonlinear systems. For 
general nonlinear systems, the posterior density function p (^ | Z, ) will not be Gaussian or any other simple 
form. The practical application of measurement partitioning to linear systems arises directly f rom the fact 
that Gaussian distributions are uniquely defined by their mean and covariance. 

The only practical method of applying measurement partitioning to nonlinear systems is to approximate the 
function p(c|Z x ) (or p(Zjt) for MLE estimates) by some simple form described by a few parameters. The 
obvious approximation in most cases is a Gaussian density function with the same mean and covariance. Tne 
exact covariance is dlff’cult to compute, but Equations (5,4-11) and (5.4-12) give good approximations for this 
purpose. 


in the linear case. The equivalent 
given by 




5.5 MULTIPLICATIVE GAUSSIAN NOISE (ESTIMATION OF VARIANCE) 

The previous sections of this chapter have assumed that the G matrix is known. The results are quite 
different when G is unknown because the noise multiplies G rather than adding to it. 

For convenience, we will work directly with GG* to avoid the necessity of taking matrix square roots. 

We compute the estimates of G by taking the positive semldefimte, symmetric -matrix square roots of the 
estimates of GG*. 

The general form of a nonlinear system with unknown G is 

Z * ff'.u) + G(e,U)u) (5.5-1) 

We will consider N independent measurements Z-j 1 ting from the experiments U-j. The Z^ are then 
independent Gaussian vectors with means fU.Uj an arlances G(t,Uf)GU,U-j)*. We will use Equa- 
tion (5.1-3) for the prior distributions of 5, • ;y s .ule (Equation (5.4-3)) then gives us the joint distri- 
bution of e and the l\ given the llj . Equations (5.4-5) and (5.*-6) define the marginal distribution of 
Z and the posterior distribution of 5 given Z. The latter distributions are cumbersome to evaluate and 
thus seldom used. 
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Because of the difficulty of computing the posterior distribution, the a posteriori expected value and 
Bayes optimal estimators <*re seldom used. We can compute the maximum likelihood estimates minimizing the 
negative of the logarithm of the likelihood functional. Ignoring irrelevant constant terms, the resulting 
cost functional is 

N 

J U) * | Ya {[z i ' f (0]*[6(OC(i)*3“[Zt - f(01 + w|G(c)6(«)*|> (5.5-2: 

i=i 


or equivalently 


p N 

J(4) - \ trace* [G( 5 )G( 4 )*]- 1 £ [Z i - - f(t)]*| 

L i=i 


1 


+ \ N £n|Gk)G(d*| 


(5.5-3) 


We have omitted the explicit dependence on Uj from the notation and assume that a^l of the Uj are identi- 
cal. (The generalization to different Uj is easy and changes little of essence.) The MAP estimator mini- 
mizes a cost functional equal to Equation (5.5-2) plus the extra term l/2[c - m^]*P ml [c - m^]. The MAP esti- 
mate of GG* is seldom used because the ML estimate is easier to compute and proves quite satisfactory. 


We can use numerical methods to minimize Equation (5.5-2) and compute the ML estimates. In most prac- 
tical problems* the following parameter partitioning greatly simplifies the computation *arired: assume that 
the £ vector can be partitioned into independent vectors sG an< * £f such that 

GG* = GG*U g ) 


f * f(C f ) (5.5-4) 

The partition £f may be empty, in which case f is a constant ( if is empty we have a known GG* 
matrix, and the nroblem reduces to that discussed in the previous section). Assume further that the GG* 
matrix is completely unknown, except for the restriction that it be positive semidefini te. 

Set the gradients of Equation (5.5-2) with respect to GG* and if equal to zero in order to find the 
unconstrained minimum. Using the matrix differentiation results (A. 2-5) and (A. 2-6) from Appendix A, we get 

0 = v GG* J(c f GG * ) 

N 

* - | (GG*)* 1 Y t Z i * >jrz i - f(4 f )]*(G3*)- 1 + { N ( GG* ) ~ 1 (5.5-5) 

i=i 

N 

0 = v^J(s f ,GG*) * - Y £ Z i - r(Cf)3(GG*)- 1 C^f(C f r)3 (5.5-6) 

Equation (5.5-5) gives 

N 

GS* 4 Z [ V f (M ][z i - (5.5-7) 

i=i 

which is the familiar sample second moment of the residuals. The estimate of GG* from Equation (5.5-7) is 
always positive semi definite. It is possible for this estimate to be singular, in which case we must use the 
techniques previously discussed for handling singular GG* matrices. For a given £f> Equation (5.5-7) is a 
simple noniterative estimator for GG*. This closed-form expression is the reason for the partition of i 
into if and 

We can constrain GG* to be diagonal, in which case the solution is the diagonal elements of Equa- 
tion (5.5-7). If we place other -/pes of constraints on GG*, such as knowledge of the values of indiv iual 
of^-dlagonal elements, such simple closed-form solutions are not apparent. In practice, such constraints are 
seldom required. 

If is empty. Equation (5.5-7) is the solution to the problem. If if is not empty, we need to 
combine this subproblem solution with a solution for if to get a solution of the entire problem. Let us 
investigate the two methods discussed in Section 5.2.3. 

The first method is axial iteration. Axial iteration involves successively estimating Cq with fixed 
5f, and estimating if with fixed Equation (5.5-5) gives the estimate in closed form for fixed £f. 

To estimate with fixed we must minimize Equation (5.5-2) with respect to £f. Unless the system is 
linear, this minimiration requires an iterative method. For fixed G, Equation (5.5-2) is in the form of a 
sum of squares and the Gauss-Newton method is an appropriate choice (in fact this subproblem is identica 1 to 
the problem discussed in Secf'i 5.4). We thus have an inner iteration within the outer axial iteration of 
if and In such situations, efficiency is often improved by terminating the inner iteration before it 
converges, inasmuch as the largest changes in the if estimates occur on the early inner iterations. After 
these early iter<- ions, more can be gained by revising GG* to reflect these large changes than by refining 
if . Since the estimates of <•* and GG* affect one another, there is no point in obtaining extremely accurate 
estimates of if until GG* is known to a corresponding accuracy. As Gauss (1809, p. 249) said concerning 
a different problem: 
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It then can only be worth while to aim at the highest accuracy, when the 
final correction is to be given to the orbit to be determined. But as long 
as it appears probable that new observations will give rise to new correc- 
tions, it will be convenient to relax, more or less, as the case may be from 
extreme precision, if in this way, the length of the computations can be 
considerably diminished. 

Exploiting this concept to its fullest suggests using only one iteration of the Gauss-Newton algorithm for the 
inner “iteration. ' In this case the inner iteration is no longer iterative, and the overall algorithm would 
be as follows: 


1. Estimate GG* using Equation (5.5-7) and the current guess of £f. 

2- Use one iteration of the Gauss-Newton algorithm to revise the estimate of £ f . 
3. Repeat steps 1 and 2 until convergence. 


In general, axial iteration is a very poor algorithm, as discussed in Chapter 2. The convergence is often 
extremely slow. Furthermore, the algorithm can converge to a point that is not a strict local minimum and yet 
give no hint of a problem. For this particular application, however, the performance of axial iteration 
borders on spectacular. 


Let us consider, for a while, the alternative to axial iteration: substituting Equation (5.5-7) into 

Equation (5.5-3). This substitution gives 

N 


J(« f ) * i N trace(I) + | N 


tn 


* E r z i - nsfJiOi - fu f )]- 


The first term is irrelevant to the minimization, so we will redefine the cost function as 


J(C f ) = f N in 


N 


K S [2 i ‘ f(5 f )][Z i * 


You may sometimes see this cost function written in the equivalent (for our purposes) form 

JU f ) - |fi6*| 


(5.S-8) 


(5.5-9) 


(5.5-10) 


Examine the gradient of Equation (5.5-9). Using the matrix differentiation results (A. 2-3) and (A. 2-6) 
from Appendix A, we obtain 


-fjy- JU f ) = -trj ££ Ilf - f (5 f )][Z i - tUf)l* - fU f )]*J 


This is more compactly expressed as 


\ JUf) = ' ? [2 i ' 


(5.5-11) 


(5.5-12) 


which is exactly the same as Equation (5.5-6) evaluated at G = G. Furthermore, the Gauss-Newton method used 
to solve Equation (5.5-6) is a good method for solvirg Equation (5.5-12) because 


N 


(5.5-13) 


Equation (5.5-13) neglects the derivative of GG* with respect to £f, but we can easily show that the term 
so neglected is even smaller than the term containing V 2 f(£f), the omission of which we previously justified. 

Therefore, axial iteration is identical to substitution of Equation (5.5-7) as a constraint. It seems 
likely that we could use this equality to make deductions about the geometry of the cost function and thence 
about the behavior of various algorithms. (Perhaps there may be some kind of orthogonality property buried 
here.) Several computer programs, including the Iliff-Maine MMLE3 code (Maine and II iff, 1980; and Maine, 
1981), use axial iteration, or a modification thereof, often with little more justification than that it seems 
to work well. This is, of course, the final and most important justification, but it is best used as verifi- 
cation of analytical arguments. Although Equations (5.5-12) and (5.5-13) are derived in standard texts, we 
have not seen the relationship between these equations and axial iteration pursued in the literature. It is 
plain that this equivalence relates to the excellent performance of axial iteration on this problem. We will 
leave further inquiry along this line to the reader. 

An important special case of Equation (5.5-1) occurs when f(£f) is linear 


fte f ) * cs f 


(5.5-14) 



64 


5.5 


with invertible C. For linear f, Equation (5.5-6) is solved exactly in a single Gauss-Newton iteration, 
and the solution is 


if =■ (C*(GG*)- l C)-»C*(GG*)-» £ £ Z 1 

i = i 

If C is invertible, this reduces to 

* c “rE ? 1 


(5.5-15) 


(5.5-16) 


i=i 


independent of GG*. This is, of course, C" 1 times the sample mean. Substituting Equations (5.5-14) 
and (5.5-16) into (5.5-15) gives 


“**nZ 


1=1 


i=i J L i=i . 


(5.5-17) 


which is the familiar sample variance. Equation (5.5-17) can be manipulated into the alternate form 

N / N 


i 




(5.5-18) 


Because if is not a function of GG*, the computation of if and GG* does not require iteration for this 
system model . 


In general, the mcximum likelihood estimates are asymptotically unbiased and efficient, but they need 
have no such properties for finite N. For linear invertible systems, the biases are easy to compute. From 
Equation (5.5-16), 

N 

tt5f|tf> - C-- $ £ Ctf ■ Ct (5.5-19) 

i=i 


This equation shows that is unbiased for finite N for linear invertible systems. From Equa- 
tion (5.5-18), using the fact that zZj is Gaussian with mean NC^f and covariance NGG*, 

E{6G*|t) = £ jjl(C5 f 5fC* + GG*) - ^ (H*Ct f ?JC* + NGG*)J * GG* (5.5-20) 

Thus GG* is biased for finite N. Examining Equation (5.5-20), we see that the estimator defined by multi- 
plying the ML estimate by N/(N- 1) is unbiased for finite N if N > 1. This unbiased estimate is often 
used instead of the maximum likelihood estimate. For large N, the difference is inconsequential. 

In this discussion, we have assumed that both GG* and are unknown. If is known, then the maxi- 
mum likelihood estimator for GG* is given by Equation (5.5-7) and this estimate is unbiased. The proof is 
left as an exercise. This result gives insight into the reasons for the bias of the estimator given by 
Equation (5.5-17). Note that Equations (5.5-17) and (5.5-7) are identical except that the sample mean is used 
in Equation (5.5-17) in place of the true mean in Equation (5.5-7). This substitution of the sample mean for 
the true mean has resulted in a bias. 

The difference between the estimates from Equations (5.5-17) and (5.5-7) can t 

[i £ v '«,][* £ • '«,]* 

As this expression shows, the estimate of GG* using the sample mean is less than or equal to the estimate 
using the true mean for every realization (i.e., the difference is positive semi definite) , equality occurring 
only when all of the Zi are equal to f(if). This is a stronger property than the bias difference*, the bias 
difference implies only that the expected value using the sample mean is less. 


*ten in the form 

(5.5-21) 


5.6 NON-GAUSSIAN NOISE 

Non-Gaussian noise Is so gene r al a classification that little can be said beyond the discussion in 
Chapter 4. The forms and properties of the estimators depend strongly on the types of noise distribution. 
The same comments apply to Gaussian noise if it Is not additive or multiplicative, because the conditional 
distribution of Z given i is then non-Gaussian. In general, we apply the rules for transformation of 
variables to derive the conditional distribution of Z given £. Using this distribution, and the prior dis- 
tribution of f if defined, we can derive the various estimators in principle. 
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The optimal estimators of Chapter 4 ofter. require considerable computation for non-Gaussian noise. It is 
often possible to define much simpler estimators which have adequate performance. We will examine one situa- 
tion where such simplification can occur. 

Let the system model be linear with additive noise 

Z - Cc + o» (5.6-1) 

The distribution of u must have finite mean and variance independent of £, but is otherwise unrestricted. 
Call the mean m, and the variance GG*. We will restrict ourselves to considering only linear estimators 
of the form 


t - K Z + D (5.6-2) 

Within this class, we will look for minimum-variance, unbiased estimators. We will require that the variance 
be minimizea only over the class of unbiased linear estimators; there will be no guarantee that a smaller 
variance cannot be attained by a nonlinear estimator. 

The bias of an estimator of the form of Equation (5.6-2) is 

b(0 - ~ KCc - £ + D - Km, (5.6-3) 

If the estimator is to be unbiased, we must have 


0 = Km, 

(5.6-4a) 

KC = I 

(5.6-4b) 

The variance of an unbiased estimator of the given form is 

var(t) s KGG*K* 

(5.6-5) 


Note that the bias and variance of the estimate depend only e.i the mean and variance of the noise distri- 
bution. The exact noise distribution need not even be known. If the noise distribution were Gaussian, a 
minimum-variance unbiased estimator would exist and be given by 

£ = { C * ( GG* ) “ 1 C ) " 1 C* ( GG* ) ” 1 ( Z i - mj (5.6-6) 

This estimator is linear. Since no unbiased estimator, linear or not, can have a lower variance for the 
Gaussian case, this estimator is the minimum-variance, unbiased linear estimator for Gaussian noise. Since 
the bias and variance of a linear estimator depend only on the mean and variance of the noise, this is the 
minimum-variance, unbiased linear estimator for any noise distribution with the same mean and variance. 

The optimality of this estimator can also be easily proven without reference to Gaussian distributions 
(although the above proof is complete and rigorous). Let 

A = K - (C*(GG*)" 1 C)" 1 C*(GG*)“ l (5.6-7) 


for any K. Then 


0 s AGG*A* = KGG*K* + (C*(GG*)- 1 C)- 1 C*(GG^)- 1 GG*(GG*)' l C(C^(GG*)‘ 1 C y - 1 

- KGG*(G6*)~ l C(C*(GG*)“ 1 C)“ 1 

- (C*(GG^)‘ 1 C)' 1 C*(GG*)' 1 GG*K* 

- KGG*K* + (C*(GG*) _1 C)~ l 

- KC(C*(GG*) _1 C)* 1 - (C*(GG*r 1 C)’ 1 C*K* (5.6-8) 

Using Equation (4.6-4b) as a constraint on K, Equation (5.6-8) becomes 

0 < KGG*K* - (C*(GG*)* l Cp (5.6-9) 

or, using Equation (5.6-5) 

var(t) 2 (C*(GG*)‘ 1 C)~ l (5.6-10) 

Thus no K satisfying Equation (5.6-4b) can achieve a variance lower than that given Dy Equation '5.6-10). 
The variance is equal to the minimum if and only if A is zero; that is if 

K = (C*(GG*)" l CpC*(GG*P (5.6-11) 

Therefore Equation (5.6-6) defines the unique minimum-variance, unoiased linear estimator. We are assuming 
that GG* and C*(GG*) _1 C are nonsingular; Section 5.3 discusses the singular cases. 

In sunmary, if the system is linear with additive noise, and the estimator is required to ue linear and 
unbiased, the results for Gaussian distributions apply to any distribution with the same mean and variance. 
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5.6 


The use of optimal nonlinear estimators Is seldom justifiable in view of the current state of the art. 
Although exceptional cases exist, three factors argue against using optimal nonlinear estimators. The first 
factor is the complexity and corresponding cost of deriving and Implementing optimal nonlinear estimators. 

For some problems, we can construct fairly simple suboptimal nonlinear estimators that give better performance 
than the linear estimators (often by slightly modifying the linear estimator), but optimal nonlinear estima- 
tion is a difficult task. 

The second factor is that linear estimators, perhaps slightly modified, often can give quite good esti- 
mates, even if they are not exactly optimal. Based on the central limit theorem, several results show that, 
under fairly general conditions, the linear estimates will approach the optimal nonlinear estimates as the 
number of samples increases. The precise conditions and proofs of these results are beyond the scope of this 
book. 


The third factor is that we seldom have precise knowledge of the distribution anyway. The errors from 
inaccurate specification of the distribution are likely to be as large as the errors from using a suboptimal 
linear estimator. We need to consider this fact in deciding whether an optimal nonlinear estimator is really 
worth the cost. From Gauss (18C9, p. 253) 

The investigation of an orbit having, strictly speaking, the maximum probabil- 
ity, will depend upon a knowledge of... [the probability distribution]; but 
that depends upon so many vague und doubtful considerations- physiological 
included-which cannot be subjected to calculation, that it is scarcely, and 
indeed less than scarcely, possible 
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CHAPTER 6 


6.0 StOCHASTIC PROCZSSES 

In simplest terms, a stochastic process is a random variable that is a function of time. Thus stochastic 
processes are basic to the study of parameter estimation for dynamic systems. A complete and rigorous study 
of stochastic process theory requires considerable depth of mathematical background, particularly for 
continuous-time processes. For the purposes of this book, such depth of background is not required. Our 
approach does not draw heavily on stochastic process theory. 

This chapter focuses on the few results that are needed for this document. Astrom (1970), Papoulis 
(1965), Lioster and Shiryayev (1977), and numerous other books give more complete treatments at varying levels 
of abstraction. The necessary results in this chapter are largely concerned with continuous-time models. 
Although we derive a few discrete-time equations in order to examine their continuous-time limits, the chapter 
can be omitted if you are studying only discrete- time analysis. 


6.1 DISCRETE TIME 

A discrete-time randor process x is simply a collection of random variables x-j, one for each time 
point, defined on the same probability space. There can be a finite or infinite number of time points. The 
stochastic process is completely characterized by the joint distributions of all of the Xi. This can be a 
rather unwieldy means of characterizing the process, however, particularly if the number of time points Is 
infinite. 


If the xi are jointly Gaussian, the process can be characterized by its first and second moments. Non- 
Gaussian processes are often also analyzed in terms of their first two moments because exact analyses are too 
complicated. The first two moments of the process x are 

m(i) = E{x.} (6.1-1) 

R(i.j) - Eix.x*} (6.1-2) 

The function R(i,j) is called the autocorrelation function of the process. 


A process is called stationary if the joint distribution of any collection of the xi depends only on 
differences of the i values, not on the absolute time. This is called strict-sense stationarity. A process 
is stationary to second order or wide-sense stationary if the first moment is constant and the second moments 
depend only on time differences; i.e., if 

R(i - k,j - k) = R(i,j) (6.1-3) 

for all i, j, and k. For Gaussian processes wide-sense stationarity implies strict-sense stationarity. The 
autocorrelation function of a wide-sense stationary process can be written as a function of one variable, the 
time difference. 


R(k) = R(i,i + k) (6.1-4) 

A process is called white if xi is independent of Xj for all i f j. Thus a Gaussian process i$ white 
if R(i,j) s 0 when i f j. Any process that is not white is called colored. A white process can be charac- 
terized by the distribution of Xj fo** each i. If a process is both white and stationary, the distribution 
of xi is the same as that of Xj for aU i and j, and this distribution is sufficient to characterize the 
process. 


6.1.1 Linear Systems Forced by Gaussian White Noise 

Our primary interest in this chapter is in the results of passing random signals through dynamic systems. 
We will first look at the simplest case, stationary white Gaussian noise passing through a linear system. The 
system equation is 


x 1+i s * X 1 + Fn i 1 = °* 1 *’** (6.1-5) 

where n is a stationary, Gaussian, white process with zero mean ard identity covariance. The assumption of 
zero mean Is made solely to simplify the equations. Results for nonzero mean can be obtained by linear super- 
position of the deterministic response to the mean and the stochastic response to the process with the mean 
removed. We are also given that x 0 is Gaussian with mean 0 and covariance P 0 , and that x 0 is Independent 
of the n-j. 

The x-j form a stochastic process generated from the nj. We desire to examine the properties of the 
stochastic process x. It 1$ immediately obvious that x is Gaussian because can be written as a linear 
combination of x 0 and n fl , n ls ...n^.,. In fact, the joint distr<bution of the xi can be easily derived by 
explicitly writing this linear relation and using Theorem (3.5-5). We will leave this derivation as an exer- 
cise, and pursue Instead a derivation using recursion along the lines that will be used In Chapter 7. 

Assume we know that Xj has mean 0 and covariance Then the distribution of follows Imme- 

diately from Equation (6 . 1-5) : 


E{x 1+1 } - *E{Xj) + FE{n i ) - 0 


( 6 . 2 - 6 ) 


PRECEDING EAGE BLANK NOT FILMED 
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E{x i+i x 1+i } * * E <Vi>** + FE{ n 1 } F* + *E{x.n*}F* + FEl^x*}** * *P.$* + FF* (6.1-7) 

The cross terms in Equation (6.1-7) drop out because xj is a function only of x 0 and n 0 , n 1 i...ni- lt all 

of which are independent of nj by assumption. We now have a recursive formula for the covariance x-j 

p i+i " *V* + FF * ^ a 0,1,,.. (6.1-8) 

where P 0 is a given point from which we can start the recursion. 

We know that the x^ are jointly Gaussian zero-mean variables with covariances given by the recursion 
(6.1-8). To complete the characterization of the joint distribution of the x*, we need only the cross- 
covariances E(x-jXj} for i t j. Assume without loss of generality that i > j. Then x^ can be written as 


l-i 

*i = * X t1 " 1 ’ kFn k (6-1-9) 

k*j 

Then 

i-i 

Etx 1 xJ) * t i ‘ j E{x j x*} + £ » 1 ' 1 ' k FE{n k xT} = * 1 * J P 1 > j (6.1-10) 

k-j 


The cross terms in Equation (6.1-10) are all zero by the same reasoning as used for Equation (6.1-7). For 
i < j, the same derivation (or transposition of the above result) gives 


ElXjXj) = P 1 (t*) j ' 1 1<j 


( 6 . 1 - 11 ) 


This completes the derivation of the joint distribution of the x-j. Note that x is neither stationary nor 
white (except in special cases). 


6.1.2 Nonlinear Systems and Non-Gaussian Noise 

If the noise is not Gaussian, analyzing the system becomes much more difficult. Except in special cases, 
we then have to work with the probability distributions as functions instead of simply using the means and 
covariances. Similar problems arise for ncrl inear systems or nonadditive noise even if the noise is Gaussian, 
because the distributions of the x-j will not then be Gaussian. 

Consider the system 


x i+1 = f ( x i .n | ) i = 0,1, . . . (6.1-12) 

Assume that f has continuous partial derivatives almost everywhere, and can be inverted to obtain nj 
(trivial if the noise is additive): 


1 < X 1 * X 1+X> 


(6.1-13) 


The n-j are assumed to be white and independent of x 0 , but not necessarily Gaussian. Then the conditional 
distribution of x i+1 given x-j can be obtained from Equation (3.4-1) 

P x . + ix^i+JV = P ni (f* 1 (x i .x i+1 ))|det(J)| (6.1-14) 

where J is the Jacobian of the transformation f* 1 . The joint distribution of x d ,...xm can then be 
obtained from 


P x (x 


C • 


x N ) 


H 

p x (*,) n p 

0 i = i 


x il x 1. 


(Xflx^j) 


(6.1-15) 


Equations (6.1-14) and (6.1-15) are, in general, too unwieldy to work with in practice. Practical work with 
nonlinear systems or non-Gaussian noise usually Involves simplifying approximations. 


6.2 CONTINUOUS TIME 

We will look at continuous-time stochastic processes by looking at limits of discrete- time processes with 
the time Interval going to 0. The discussion will focus on how to take the limit so that a useful result Is 
obtained. We will not get involved in the Intricacies of Ito ur Stratanovlch calculus (Astrom, 1970; 
Jazwlnskl, 1970; and Llpster and Shlryayev, 1977). 

6.2.1 Linear Systems Forced by White Noise 

Consider a linear continuous-time dynamic system driven by white, zero-mean noise 

x(t) « Ax(t) + F c n(t) (6.2-1) 
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We would like to look at this system as a limit (in some sense) of the discrete-time systems 

x(t. + a) ■ (I + ^A)x(t 1 ) + AF c n(t 1 ) (6.2-2) 

as a, the time interval between samples, goes to zero. Equation (6.2-2) is in the form of Euler's method for 
approximating the solution of Equation (6.2-1). For the moment we will consider the discrete n( t-j ) to be 
Gaussian, The distribution of the n(t-j) is not particularly important to the end result, but our argument is 
somewhat easier if the n(t-j) are Gaussian. Equation (6.2-2) corresponds to Equation (6.1-5) with I + Aa 
substituted for aF c substituted for F, and some changes in notation to make the discrete and continuous 
notations more similar. 

If n were a reasonably behaved deterministic process, we would get Equation (6.2-1) as a limit of Equa- 
tion (6.2-2) when a goes to zero. For the stochastic system, however, the situation is quite different. 
Substituting I + aA for $ and aF c for F in Equation (6.1-8) gives 

P(t 1 + A) « (I + AA)P(t t )(l + aA)* + a 2 F c FJ (6.2-3) 

Subtracting P(t-j) and dividing by A gives 
P(t, + A) - P(t.) 

* j » AP(t.) + PCt i )A* + AAP(t i )A* + aF c FJ (6.2-4) 


Thus in the limit 


P(t) = AP(t) + P(t)A (6.2-5) 

Note that F c has completely dropped out of Equation (6.2-5). The distribution of x does not depend on the 
distribution of the forcing noise. In particular, if P 0 = 0, then P ( t ) = 0 for all t. The system simply 
does not respond to the forcing noise. 

A model in which the system does not respond to the noise is not very useful. A useful model would be one 
that gives a finite nonzero covariance. Such a model is achieved by multiplying the noise by a ' 1 ' 2 (and thus 
its covariance by A" 1 ). We rewrite Equation (6.2-2) as 

x(t 1 + 4) * (I + 4A)x(t.) + A l/a F c n(t 1 ) (6.2-6) 

The A in the aF c FJ term of Equation (6.2-4) then disappears and the limit becomes 

P{t) * AP(t) + P(t)A* + F C FJ (6.2-7) 

Note that only a a -1 behavior of the covariance (or something asymptotic to A* 1 ) will give a finite nonzero 
result in the limit. 

We will thus define the continuous-time white-noise process in Equation (6.2-1) as a limit, in some 
sense, of discrete-time processes with covariances A' 1 . The autocorrelation function of the continuous- time 
process is 


R ( t , r ) = E{n(t)n(t)*> « A(t - t) (6.2-8) 

The impulse function 6 ( s ) is zero for x f 0 and infinite for s = 0, and its integral over any finite range 
including the origin is 1. We will not go through the mathematical formalism required to rigorously define 
the impulse function- suffice it to say that the concept can be defined rigorously. 

This model for a continuous-time white-noise process requires further discussion. It is obviously not a 
faithful representation of any physical process because the variance of n(t) is infinite at every time point. 
The total power of the process is also infinite. The response of a dynamic system to this process, however, 
appears well-behaved. 

The reasons for this apparently anomalous behavior are most easily understood in the frequency domain. 

The power spectrum of the process n is flat; there is the same power in every frequency band of the same 
width. There is finite power in any finite frequency range, but because the process has infinite bandwidth, 
the total power is infinite. Because any physical system has f.iite bandwidth, the system response to the 
noise will be finite. If, on the other hand, we kept the total power of the noise finite as we originally 
tried to do, the power in any finite frequency band would go to zero as wc approached infinite bandwidth \ thus, 
a physical system would have zero response. 

The preceding paragraph explains why it is necessary to have infinite power in a meaningful continuous- 
time white-noise process. It also suggests a rationale for justifying such a mouel even though any physical 
noise source must have finite power. We can envision the physical noise as being band limited, but with a 
band limit much larger than the system band Hm : t. If the noise band limit Is large enough, its exact value 
is unimportant because the system response to inputs at a very high frequency 1c negligible. Therefore, we 
can analyze the system with white noise of infinite bandwidth and obtain results that are very good approxima- 
tions to the finite-bandwidth results. The analysis is much simpler in the infinite-bandwidth white-noise 
model (even though some fairly abstract mathematics Is required to make It rigorous). In summary, continuous- 
time white-noise Is not physically realizable but can give results that are good approximations to physical 
systems. 
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6.2.2 Additive White Measurement Noise 


We saw In the previous section that continuous-time white noise driving a dynamic system must have 
Infinite power In order to obtain useful results. We will show In this section that the same conclusion 
applies to continuous-time white measurement noise. 

We suppose that noise-corrupted measurements z are made of the system of Equation (6.2-1). The mea- 
surement equation Is assumed to be linear with additive white noise: 

z(t) * Cx(t) + G c n(t) (6.2-9) 

For convenience, we will assume that the mean of the noise Is 0. We then ask what else must be said about 
n(t) In order to obtain useful results from this model. 

Presume that we have measured z(t) ove^ the Interval 0 < t < T, and we want to estimate some character- 
istic of the system- say, x(T). This Is a filtering problem, which we will discuss further in Chapter 7. For 
current purposes, we will simplify the problem by assuming that A ■ 0 and F ■ 0 In Equation (6,2-1). Thus 
x(t) is a constant over the interval, and dynamics do not enter the problem. We can consider this a static 
problem with repeated observations of a random variable, like those situations we covered in Chapter 5, 

Let us look at the limit of the discrete-time equivalents to this problem. If samples are taken every 
A seconds, there are a' 1 T total samples. Equation (5.1-31) is the MAP estimator for the discrete-time 
problem. The mean square error of the estimate is given by Equations (5.1-32) to (5.1-34). As a decreases 
to 0 and the number of samples Increases to infinity, the mean square error decreases to 0. This result would 
imply that continuous-time estimates are always exact; It is thus not a very useful model. To get a useful 
model, we must let the covariance of the measurement noise go to infinity like A” 1 as A decreases to 0. 
This argument is very similar to that used in the previous section. If the measurement noise had finite 
varlanca, each measurement would give us a finite amount of Information, and we vjould have an infinite amount 
of Information (no uncertainty) when the number of measurements was Infinite. Thus the discrete-time equiva- 
lent of Equation (6.2-9) is 


z(t,) - Cxt-fc, ) ♦ 4" l/a G c n{t 1 ) 


( 6 . 2 - 10 ) 


where n(t^) has identity cu*«ri?nre. 

Because any measurement is made using a physical device with a finite bandwidth, we stop getting much new 
information as we take samples faster than the response time of the instrument. In fact, the measurement equa- 
tion Is sometimes written as a differential equation for the instrument response instead of in the more ideal- 
ized form of Equation (6.2-9). We need a noise model with a finite power In the bandwidth of the measurements 
because this is the frequency range that we are really working in. This argument is essentially the same as 
the one we used in the discussion of white noise forcing the system. The white noise can again be viewed as 
an approximation to band-limited noise with a large bandwidth. The lack of fidelity in representing very hlgn- 
frequency characteristics is not too Important, because high frequencies will tend to be filtered out when we 
operate on the data. (For instance, most operations on continuous- time data will have integrations at some 
point.) As a consequence of this modeling, we should be dubious o, the practical application of any algorithm 
which results from this analysis and does not filter out high-frequoncy data in some manner. 

We can generalize the conclusions in this and the previous section. Continuous-time white noise with 
finite variance is generally not a useful concept in any context. We will therefore take as part of the defi- 
nition of continuous-time white noise that It have Infinite covariance. We will use the spectral density 
rather than the covariance as a meaningful measure of the noise amplitude. White noise with autocorrelation 

R(t.i) • G c GJ6(t - t) (6.2-11) 

has spectral density G c g£. 

6.2.3 Nonlinear Systems 

As with discrete-time nonlinearities, exact analysis of nonlinear continuous-time systems is generally so 
difficult as to be Impossible for most practical intents and purposes. The usual approach Is to use a linear- 
ization of the system or some other approximation. 

Let the system equation be 

x(t) * f(x,t) + g(x,t)n(t) (6.2-12) 

where n is zero-mean white noise with unity power, spectral density. For compactness of notation, let p 
represent the distribution of x at time t, given that x was x 0 at time t 0 . The evolution of this 
distribution Is described by the following parabolic partial differential equation: 


i£ 

at 




1*1 




ax<ax. ( p9 lk 9 jk^ 


(6.2-13) 


where n Is the length of the x vector. The initial condition for this equation at t 1 t, is 
p • 6(x - x 0 ). See Jazwinskl (1970) for the derivation of Equation (6.2-13). This equation is called the 
Fokker-Planck equation or the forward Kolmogorov equation. It Is considered one of the basic equations of 
nonlinear filtering theory. In principle, this equation completely describes the behavior of the system and 
thus the problem Is "solved." In practice, the solution of this multidimensional partial differential equa- 
tion Is usually too formidable to consider seriously. 
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CHAPTER 7 


7.0 STATE ESTIMATION FOR DYNAMIC SYSTEMS 

In this chapter, we address the estimation of the state of dynamic systems. The emphasis Is on linear 
dynamic systems with additive Gaussian noise. We will Initially develop the theory for discrete-time systems 
and then extend It to continuous-time and mixed continuous/discrete models. 

The general form of a linear discrete-time system model Is 

x 1+1 * ♦x 1 + t Ul + Fn^ 1 ■ 0,1,.*. (7.0-la) 

z^ * CXj + Du^ + Gn.j 1 * 1,2,... (7.0-lb) 

The n< and nj are assumed to be Independent Gaussian noise vectors with zero mean end Identity covariance. 

The noise n is called process noise or state noise; n Is called measurement noi*e. The Input vectors, Uj, 
are assumed to be known exactly. The state of the system at the 1th time point Is xj. The Initial condi- 
tion x 0 is a Gaussian random variable with mean m 0 and covariance P B . (P 0 can be zero, meaning that the 
initial condition is known exactly.) 

In general, the system matrices *, f, F, C, D, and G can be functions of time. This chapter will assume 
that the system is time-invariant In order to simplify the notation. Except for the discussion of steady-state 
forms in Section 7.3, the results are easily generalized to tlme-varyloy systems by adding appropriate time 
subscripts to the matrices. 

The state estimation problem is defined as follows: based on the measurements z l$ z 2 ...zn» estimate the 

state xpj. To shorten the notation, we define 

Z N * (z lf z 2 ...z N )* (7.0-2) 

State estimation problems are conmonly divided Into three classes, depending on the relationship of M and N. 

If M is equal to N, the problem is called a filtering problem. Based on all of the measurements taken 
up to the current time, we desire to estimate the current state. This type of problem is typical of those 
encountered In real-time applications. It is the most widely treated one, and the one on which we will 
concentrate. 

If M Is greater than N, we have a predicts , ''»m. The data are available up tc the current time 
N, and we desire to predict the state at some future tine M. We will see that once the filtering problem Is 
solved, the prediction problem Is trivial. 

If M Is less than N, the problem Is called a smoothing problem. This type of problem Is most comnonly 
encountered In postexperiment batch processing in which all of the data are gathered before processing begins. 
In this case, the estimate of x^ can be based on all of the data gathered, both before and after time M. 

By using all values of M from 1 to N - I, plus the filtered solution for M ■ N, we can construct the esti- 
mated state time history for the interval being processed. This is referred to as fl ;ed~ interval smoothing. 
Smoothing can also be used a real-time environment where a few time points of delay in obtaining current 
state estimates is an acceptable price tor the Improved accuracy gained. For instance, It might be acceptable 
to gather data up to time N * M + 2 before computing the estimate of x^. This is called fixed-lag smooth- 
ing. A third type of smoothing Is fixed-point smoothing; in this cose, it Is desired to estimate x^ for a 
particular fixed M in a real-time environment, using new data to improve the estimate. 

In all cases, xn will have a prior distribution derived from Equation (7.0-la) and tie noise distribu- 
tions. Since Equation (7.0-1) is linear In the noise, and the noise is assumed Gaussian, the prior and 
posterior distributions of xn will be Gaussian. Therefore, the a posteriori expected value, MAP, and many 
3ayes' minimum risk estimators will be identical. These are the obvious estimators for a problem with a well- 
defined prior distribution. The remainder of the chapter assumes the use of estimators. 


7.1 EXPLICIT FORMULATION 

By manipulating Equation (7.0-1) Into an appropriate form, we can write the state estimation problem as a 
special case of the static estimation problem studied In Chapter 5. In this section, we will solve the problem 
by such manipulation; the fact that a dynamic system Involved will thus play no special role In the meaning 
of the estimation problem. We will examine only the titering problem here. 

Our aim Is to manipulate the state estimation problem Into the form of equation (5.1-1). The most obvious 
approach to this proolem Is to define the £ of Equation (5,1-1) to be xr, the vector which we desire to 
estimate. The observation, Z, would be a concatenation of and the Input, U, would be a concatena- 

tion of u 0 ,.,.,un- 1 . The noise vector, u, would then have to be a concatenation of n lf .. . 

The problem can indeed be written In this manner. Unfortunately , the prior distribution of xpj is not Inde- 
pendent of r, J .,...,nw. 1 (except for the case N * 0); therefore, Equation (5.1-16) is not the correct expres- 
sion for the RAP estimate cf xr. Of course, we could derive an appropriate expression allowing for the corre- 
lation, but we will take an alternate approach which allows the direct use of Equation (5.1-16). 

Let the unknown parameter vector be the concatenation of the initial condition and all of the process 
noise vectors. 



n 


C • [x l ,n 0> n l ,...n N< . l ]* 


7.1 

(M-l) 



The vector xu, which we really desire to estimate, can be written as an explicit function of the elements ot 
t; In particular. Equation (7.0-la) expands Into 


N-> 

X N * *\ + S ^’^i * Fr V (7.1-21 

1«o 

We can compute the MAP estimate o'" xjy by uslnc the MAP estimates of x 0 and nj In Equation (7.1-2). Note 
that we can freely treat the nj as noise or as unknown parameter* with prior distributions without changing 
the essential nature of the problem. The probability distribution of Z Is Identical In either case. The 
only distinction Is whether or not we w-wt estimates of the nj. For this choice of e, the remaining Items 
of Equation (5.1-1) must be 


z * 

If ■ [u,.u 1 ,....u H _ l ]* (7.1-3) 

<*> * tnj.nj n N ]* 

We get an explicit formula for z\ by substituting Equation (7.1-2) Into Equation (7.0-lb), giving 


l-i 


z. * C<t 1 x 0 + C ♦ 1 “ x * ,3 (vu j + Frij) * Du 1 + 6n 1 


which can be written in the form of Equation (5.1-1) with 


C(U) 


0(U) 


G(U) 


6 0 
0 G 

C 0 
0 


0 0 
0 0 

G 0 
0 G 


(7.1-4) 


c* 

CF 

0 

0 

0 



c* a 

C$F 

CF 

0 

0 



• 

• 

. 

* 

• 


(7. l-5« ) 

C ^- 1 

Ce H - a F 

c* n "*f ... 

CF 

0 



v 

Ce N-1 F 

Ce M - a F ... 

C»F 

CF 

j 



Cv 

D 

0 

0 

o - 



C*y 

Cy 

D 

0 

0 



• 

* 

. 

• 

* 

tu] 

(7.1-5b) 

Ce N "S 

U N ** 2 y 

C* N * -, y ... 

D 

0 



b Ce N y 

Ce N “ l y 

Ca N ~ 2 y ... 

Cy 

D. 




(7.1-5C) 


You can easily verify these matrices by substituting them Into Equation (5.1-1). The mean and covariance 
the prior dlstrl* r!on of t are 





o 

o 

o 

o 


0 


0 1 0 ... 0 

m ( - 

0 

. f * 

C 0 0 ... 0 

t • a » • • i 


. 


• . a | M . 


1 

o 

* 


o 

o 

JL. 


The MAP estimate of c Is then given by Equation (5.1-16). The MAP estimate of x^, which we seek. Is 
obtained froir. that of c by using Equation (7.1-2). 
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The filtering problem Is thus "solved." This solution, however, Is unacceptably cumbersome. If the 
system state Is an i-vector, the Inversion of an (N + l)i-Dy-(N ♦ l)i matrix Is required In order to estimate 
Xfl. The computational costs become unacceptable after a very few time points. We could Investigate whether It 
Is possible to take advantage of the structure of the matrices given In Equation (7.1-5) In order to simplify 
the computation. We can more readily achieve the same ends, however, by adopting a different approach to 
solving the problem from the start. 


7.2 RECURSIVE FORMULATION 


To find a simpler solution to the filtering problem than tn*; derived in the preceding section, we need to 
take better advantage of the special structure of the problem. The above derivation used the linearity of the 
problem and the Gaussian assumption on the noise, which art secondary features of the problem structure. The 
fact that the problem Involves a dynamic state-space model is much mr* basic, but was not used above to any 
special advantage; the first step In the derivation was to recast the system in the form of a static model. 

Let »» reexamine the problem, making use of the properties of dynamic state- space systems. 


The defining property of a state-space model Is as follows: the future output Is dependent only on the 
current state and the future input. In ether words, provided that the current state of the system Is known, 
knowledge of any previous states. Inputs, or outputs, is Irrelevant to the prediction of future system behav- 
ior; all relevant facts about previous behavior are subsumed in the knowledge of the current state. This is 
essentially the definition of the state of a system. The probabilistic expression of this idea is 


p ^N ,z N+i“-l x N* * p ^N ,z N+r ,, l x N ,x N-i" ,u N-i ,u N-s” ,Z N 


(7.2-1) 


It is this property that allows the systt.n to be described In a recursive form, «uch as that of Equa- 
tion (7.0-1). The recursive form involves much less computation than the mathematically equivalent explicit 
form of Equation (7.1-4). 

This reasoning suggests that recursion might be used to some advantage In obtaining a solution to the 
filtering problem. The estimators under consideration (MAP, etc.) are all defined from the conditional dis- 
tribution of xr given Zr. We will seek a recursive expression for the conditional distribution, and thus 
for the estimates. We will prove that such an expression exists by deriving it. 

In the nature of recursive forms, .-.a start by assuming that the conditional distribution of x N given 
Is known for some N, and then we attempt to derive an expression fir the conditional distribution of 
xn + 1 given Zn + 1 . We recognize this task as similar to the measurement partitioning of Section 5.2.2, In that 
we want to simplify the solution by processing the measurements one at a time. Equations (5.2-2) and (7.2-1) 
express similar ideas and give the basis for the simplifications in both cases. (The x^ of Equation (7.2-1) 
corresponds to the c of Equation (5.2-2).) 

Our task then Is to derive p(xfl + , ]Z# + , ). We will divide this task iruo two steps. First, derive 
p(x N+1 |Zy) from p(xu j Zu) . This is cal^'- *te prediction step, because we are predicting xm +l based on pre- 
vious information, it is also called . time update because we are updating the estimate to a new time poirt 
based on the same data. The second step Is to derive p(x# +; |Z n + 1 ) from p(xn +1 |Zn), Th^ Is called the 
correction step, because we are correcting the predicted estimate of xjy +1 based on the <iew informal 'on In 
zn+ 7 . It is also called the measurement update because we are updating the estimate based on the new 
measurement. 


Since all of the distributions are assumed to be Gaussian, they are completely defined by their means and 
covariance matrices. Denote the (presumed known) mean and covarlame of the distribution p (x^ I by x* an* 
Pf*. respectively. In general, xu and P# are functions of Zr, but. we will not encumber the notation with this 
information. Likewise, denote the mean and covariance of p(x^ +1 |Z^) by x^ +1 and Qy +1 . The task ‘a thus to 
derive expressions for *n + 1 and Q^ +1 In terms of x^ and and expressions for x^ +1 and P^ +1 In terms of 
*N+i *nd Qn+i- 

7.2.1 P rediction Step 

The prediction step (time update) Is straightforward. For x^ +1 , simply take the expected value of 
Equation (7.0-la) conditioned on Zr. 

E<x n+ 1 |Z n ) - ♦t{x N |Z N ) ♦ tu N ♦ FEln N |Z N i (7.2-2) 

The quantities E{x^ +1 |2n> and E{x^|Zh) are, by definition, x^*. and xy, respectively. Z^ Is a function of 
x,, n 0 ,. .. ,n^. 1 ,n 1 ,...r) N , and deterministic quantities; Is independent of all of these, ^nd therefore 
Independent of Zr. Thus 

E { n N |Z N } - E{n N > • 0 (..2-3) 

Substituting this Into Equation (7.2-2) gives 

x N+i ‘ * X N + Tu N 


In order to evaluate Q^,, take the covariance of both sides of Equation (7.0-la). Since tnc t< r^e on 

the right-hand side of tK equation are Independent, the covariance of their sum is the sum of th^r 
covariances. 


cov{x N+l |Z N ) * ♦ cov(x N |Z N H* * cov{tu N |Z N ) ♦ F covIn^jZ^JF* 


(V.i-5) 
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The terms cov{x^ +i |ZN> and cov{xn|Zn> are, by definition, Qn + 1 
and, thus, has zero covariance. By the Independence of and 


and Pr, respectively, fuj^ is deterministic 


COV{n N |z > - ccvifl^; * ! 


(7.2-6) 


Substituting these relationships into Equation 1/.2-S) gives 


^N+i 


*P,^ + FF* 


(7.2-7) 


Equations (7.2-4) and (7.2-7) constitute the results desired for the prediction step (time update) of the 
filtering problem. Th<y readily generalize to predicting more than one sample ahead. These equations justify 
our earlier statement that, once the filtering problem is solved, the prediction problem is easy; for suppose 
we desire to estimate based on with M > N. If we can solve the filtering problem to obtain x N , the 
filtered estimate of x^, then, Dy a straightforward extension of Equation (7.2-4), 


E(x m |Z n ) 


.M-N ; 

4> X, 


M-i 

*E 

i=N 


.M-i-i 




(7.2-8) 


is the deseed MAP estimate of x^. 

7.2.2 Correction Step 

For the correction step (measurement update), assume that we know the mean, xn+i» and covariance, Qn+i* of 
the distribution of x^+i given Zr. We seek the distribution of XN+ r given both and zjj +1 . From 
Equation (7.0-lb) 


z N+i = Cx N+i + ^N+i + 6n N+i (7.2-9) 

The distribution of hn+i Is Gaussian with zero mean and identity covariance. By the same argument as used 
for n^, nn+i. is independent of Z^. Thus, we can say that 

p ( n N+JV * p ( n N+i) (7.2-10) 

This trivial -looking statement is the key to the problem, for now everything in the problem is conditioned in 
Zf(, we know the distributions of xn+ x and nN+i conditioned on Zr, i d we seek the distribution of xr+! 
conditioned on Zn, and additionally conditioned on Zn+ x . 

This problem is thus exactly in the form of Equation (5.1-1), except that all of the distributions 
invol^d are conditioned on Zr. This amounts to nothing more than restating the problem of Chapter 5 on a 
different probability space, one conditioned on Zr. The previous results apply directly to the new probabil- 
ity space. Therefore, frc • Equations (5.1-14) and (5.1-15) 

Vi * Vi + P N+1 -*(GG*)'Mz N+l - Ci N+1 - D Vl ) (7.2-.,. 

Vi = ( c *< GG *)' lc + QjJ },)' 1 (7.2-12) 

In obtaining Equations (7.2-11) and (7.2-12) from Equations (5.1-14) and (5.1-15), we have identified the 
following quantities: 


(5. 1-14), (5. 1-15) 

(7. 2-11), (7. 2-12) 


Vi 

P 

Vi 

z 

Z N+i 

c 

c 

D 

^ u N+i 

EU|Z) 

x N+i 

cov(sl7} 

Vi 

GG* 

GG* 


This competes the derivation of the correction step (measurement update), which we see to be a direct aopll- 
cation of the results from Chapter 5. 

7.2.3 Kal man Filter 

To complete the recursive solution to the filtering problem, we need only know the solution for some value 
of N, and we can now propagate that solution to larger N The solution for N ■ U Is Immediate from the 
initial problem statement. The distribution of x 0 , conditioned on Z 0 (1.e. a conditioned on nothing because 
*1 a (zi,...,z-|)*), Is given to be Gaussian with mean m 0 and covariance P 0 . 


i 

4 
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i 
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Let us now fit together the pieces derived above to show how to solve the filtering problem: 
Step 1: Initialization 


P 0 is given 

Step 2: Prediction (time update), starting with i = 0, 


x i+1 * Wl + fu, 


z 1+i 5 Cx 1+i + “Vi 


<1+i 


= $?,.$* ♦ FF* 


Step 3: Correction (measurement update) 


P i+1 = (C*(GG*) _i C + QT ij * 1 

*1+1 - 5 i + i + W<GG*ru i+1 - W 


(7.2-13) 

(7.2-14) 

(7.2-15) 


(7.2-16) 

(7.2-17) 


We have defined the quantity £; +1 by Equation (7.2-14) in order to make the form of Equation (7.2-17) more 
apparent; z^ +1 can easily be shown to be E{Zi +1 |Z^}. Repeat the prediction and correction steps for 
i = 0, 1.....N - 1 in order to obtain xr, the MAP estimate of based on z x ,...,Zh. 

Equations (7.2-13) to (7,2-17) constitute the Kalman filter for discrete-time systems. The recursive form 
of this filter is particularly suited to real-time applications. Once x^ has been computed, it is not 
necessary, as it was using the methods of Section 7.1, to start from scratch in order to compute xw +l ; we need 
do only one more prediction step and one more correction step. It is extremely important to note that the 
computational cost of obtaining x*j +1 from xr is not a function of N. This means that real-time Kalman 
filters can be implemented using fixed finite resources to run for arbitrarily long time intervals. This was 
not the case using the methods of Section 7.1, where the estimator started from scratch for each time point, 
and each new estimate required more computation than the previous estimate. For some applications, it is also 
important that the Pj and Qi do not depend cn the measurements, and can thus be precomputed. Such precompu- 
tation can significantly reduce real-time computational requirements. 

hone of these advantages should obscure the fact that the Kalman filter obtains the seme estimates as ..-ere 
obtained in Section 7.1. The advantages of the Kalman filter lie in the easier computation of the estimates, 
not in improvements in the accuracy of the estimates. 

7.2.4 Alternate Forms 


The filter Equations (7.2-13) to (7.2-17) can be algebraically manipulated into several equivalent alter- 
nate forms. Although all of the variants are formally equivalent, different ones have computational advantages 
in different situations. Son« of the advantages lie in different points of singularity and different size 
matrices to invert. We will show a few of the possible alternate forms in this section. 

The first variart comes from using Equations (5.1-12) and (5.1-13) (the covariance form) instead of 

(5.1-14) and (5.1-15) (the information form). Equations (7.2-16) and (7.2-17) then become 

P 1 +1 ■ Va * Q i+i C * (CQ i+i C * + GG*)* l CQ i+1 0 . 9 - 1 *) 

x i +l - *i + , + Qi+^KV + - z i+1 ) 

The covariance form is particularly useful if GG* or any of the Qj are singular. The exact conditions 
under which Q-j can become singular are fairly complicated, but we can draw some simple conclusions from look- 
ing at ''ration (7.2-15). First, if FF* is nonsingular, then can never be singular. Second, a singular 
P 0 (pai ,!..clarly P 0 = 0) is likely to cause problems if FF* is also singular. The only matrix to invert In 
Equation* (7.2-18) and (7.2-19) is CQ-j+iC* + GG*. If this matrix is singular the problem is ill-posed; the 
situation is the same as that discussed in Section 5.1.3, 

Note that the covariance form involves inversion of an t-by-t matrix, where t is the length of the 
observation vector. On the other hand, the Information form involves inversion of a p-by-p matrix, where p 

is the length of the state vector. For some systems, the difference between t and p may be significant, 

resulting in a strong preference for one form or the other. 

If b is diagonal (or if GG* is diagonal izable the system can be rewritten with a diagonal G), 
Equations (7.2-18) and (7.2-19) can be manipulated into a form that Involves no matrix inversions. The key to 
this manipulation is to consider the system to have i independent scalar observations at each Vm point 
Instead of a single vector observation of length i. The scalar observations can then be processed or. at a 
time. The Kalman filter partitions the estimation problem by processing the measurements one time-point at a 
time; with this modification, we extend the same partitioning concept to process one element of the measurement 
vector at a time. The derivation of the measurement-update Equations (7.2-18) and (7.2-19) applies without 
change to a system with several independent observations at a time point. We need only apply the me. su remen t- 
update equation i times wi.h no intervening time updates. We do need a little more complicated notation to 
keep track of the process, but the equations are basically the same. 
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Let and D'^ be the jth rows of the C and D matrices, 6^*^ be the jth diagonal element of 

G, and zjjj be the jth element of 7 <+1 . Define Xin,i t0 be the estimate of x^ +l after the jth 
scalar observation at time i + 1 has been processed, and define Pi+ij to be the covariance of a 1+1 j. 
We start the measurement update at each time point with 


x i+i,o 


*i+i 


(7.2-20) 


P i+i,o = «i+i 


(7.2-21) 


Then, for each scalar measurement, we do the update 

t s p _ p r(j)*/r(j)p r 

i+i,j+i K i+i,j i+i,j L i+i,j L 

x “x + P C^fC^P r(j) 

X i + i,j + ' *i+*,j P i+i,j C P i+* ,j C 


(j)* + G (j.j) 2 wi c (j) p 

i+i.j 

* + . jjj*)) 


(7.2-22) 

(7.2-23) 


where 


;(j+i) 

z i+i 


. + D (j+l) u. 

1+1,J 1+1 


(7.2-24) 


Note that the inversions in Equations (7.2-22) and (7.2-23) are scalar inversions rather than matrices. None 
of these scalars will be 0 unless CQi+ x C* + GG* is singular. After processing all t of the scalar measure- 
ments for the time point, we have 


*i+i 


x i+i,£ 


(7.2-2b) 


P i+’ = P i+i,£ 


(7.2-26 


7.2.5 Innovations 

A discussion of the Kalman filter would be incomplete without some mention of the innovations. The inno- 
vation at sample point i, also called the residual, is 

v i = z i - 2 , (7.2-27) 

where 

5 i = E{z i |Z._ l ) = Cx. + Du. (7.2-28) 

Following the notation for Zj, we define 

V i = [v lf v 2 v^* (7.2-29) 

Now V-j is a linear function of Z^. This is shown by Equations (7.2-13) to (7.2-17) und (7.2-27), which give 
formulae for computing the v-j in terms of the z-j. It may not be immediately obvious that this function is 
invertible. We will prove invertibility by writing the inverse function; i.e., by expressing Z-j in terms of 
V-j. Repeating Equations (7.2-13) and (7.2-14): 


Substituting Equation (7.2-27) Into 


Finally, from Equation (7.2-27) 


*1+1 = «*1 + ^1 

(7.2-30a) 

z 1+i " c *i+i + Du m 

(7.2-30b) 

Equation (7.2-17) gives 


* 1 +i * *l+i + P 1+1 C*(GG*)-*v i+1 

(7.2-30C) 

z 1+i = z l+i + v l+i 

(7.2-30d) 


Equation (7.2-30) is called the innovations form of the system. It gives the recursive formula for computing 
the z-j from the v-j. 


Let us examine the distribution of the innovations. Tne innovations are obviously Gaussian, because they 
?»re linear functions of Z, which is Gaussian. Using Equation (3.3-10), it is Immediate that the mean of the 
innovation is 0. , , 

E{v.} = £{2, - E( 2 i |Z i . l )} 


- E{ 2i > - E{E(z 1 [Z i _ x )} * 0 

Derive the covar1ar..e matrix tf the Innovation by writing 


(7.2-31) 


v.| * CXj + Du. + Gn.| - Cx^ - Du^ 
* C(x i - i^) + Gn i 


* "*<*£** 4 . 


(7.2-3?' 
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The two terms on the right are independent, so 

covfvj ) = C cov(Xj - x^)C* + GG* 

= CQ i C* + GG* (7.2-33) 

The most interesting property of the innovations is that v,* is independent of Vj for i f j. To 
prove this, it is sufficient to show that v< is independent of V-^. Let us examine E{v^|Vi- ; }. Since 
V-f-i is obtained from by an invertible continuous transformation, conditioning on is the same 

as conditioning on (if one is known, so is the other.) Therefore, 

E{ v i |V i . 1 J » E(v i |Z i . 1 } = 0 (7.2-34) 

as shown in Equation (7.2-31). Thus we have 

E{v ii V w } * E{v i } (7.2-35) 

Comparing this equation with the c ormula for the Gaussian conditional mean given in Theorem (3.5-9), we see 
that this can be true only if v, and V-j_, are uncorrelated (a 12 = 0 in the theorem). Then by 
Theorem (3.5-8), v^ and V-j^ are independent. 

The innovation is thus a discrete-time white-noise process (i.e., each time point is independent of all 
of the others). Thus, the Kalman filter is often called a wnitening filter; it creates a white process (V) 
as a function of a nonwhite process (Z). 


7.3 STEADY-STATE FORM 

The largest computational cost of the Kalman filter is in the computation of the covariance matrix Pj 
using Equations (7.2-15) and (7.2-16) (or any of the alternate fonns). For a large and important class of 
problems, we can replace Pj and Qj by constants P and Q, independent of time. This approach significantly 
lowers computational cost of the filter. 

We will restrict the discussion in this section to time-invariant systems; in only a few special cases 
do time-invariant filters make sense for time-varying systems. 

Equations that a time invariant filter must satisfy are easily derived. Using Equations (7.2-18) 
and (7.2-15), we can express Q x + x as a function of Q^. 

Q i+l * ♦[0 i - Q i C*(CQ i C* + GG*)’' 1 CQ.]4* + FF* (7.3-1) 

Thus, for Qj to equal a constant Q, we must have 

r « *[Q - QC*(CQC* + GG*) -1 CQ]** + FF* (7.3-2) 

This is the algebraic matrix Riccati equation for disr ^ete-time systems. (An alternate form can be obtained 
by using Equation (7.2-16) in place of Equation (7.2 18); the condition can also be written in terms of P 
instead of Q). 

If Q is a scalar, the algebraic Riccati equation is a quadratic equation in Q ana the solution is 
simple. For nonscalar Q, the solution is far more difficult and has been the subject of numerous papers. 

We will not cover the details of deriving and implementing numerical methods for solving the Riccati equation. 
The most widely used methods are based on eigenvector decomposition (Potter, 1966; Vaughan, 1970; and Geyser 
and Lehtinen, 1975). When a unique solution exists, these methods give accurate results with small computa- 
tional costs. 

The derivation of the conditions under which Equation (7.3-2) has an acceptable solution is more compli- 
cated than would be appropriate for inclusion in this text. We therefore present the following result without 
proof: 


Theorem 7.3-1 If all unstable or marginally stable modes of the system are 
controllable by the process noise and are observable, and if CFF*C* + GG* 
is invertible, then Equation (7.3-2) has a unique positive semidefinlte solu- 
tion and Q-j converges to this solution for all choices of the initial 
covariance, P 0 . 

Proof See Schweppe (1973, p. 142) for a heuristic argument, or Balakrishnan 
XI95T) and Kailath and Lyung (1976) for more rigorous treatments. 

The condition on CFF*C* + GG* ensures that the problem is well-posed. Without this condition, the inverse 
in Equation (7.3-1) may not exist for some initial P 0 (particularly P 0 = 0). Some statements or the theorem 
incorpore-e the stronger requirement that GG* be invertible, but che weaker condition is sufficient. Perhaps 
the most Important point to note is that the system is not required to be stable. Although the existence and 
uniqueness of the solution are easier to prove for stable systems, the more general conditions of 
Theorem (7.3-1) are important In the estimation and control of unstable systems. 

We can achieve a heuristic understanding of the need for the conditions of Theorem (7.3-1) by examining 
one-dimensional systems, for which we can write the solutions to Equation (7.3-2) explicitly. If the system 
Is one-dimensional, then it Is observable if C is nonzero (and G Is finite), and It is con. liable uy the 
process noise If F Is nonzero. We will consider the problem In several cases. 
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C ase 1: G * 0. In this case, we must have C f 0 and F t 0 In order for the problem to be well -posed. 
Equation (?. 3-1) then reduces to Qj +1 ■ FF*, giving a unique time-invariant covariance satisfying 
Equation (7.3-2). 

Case 2: G f J, C * 0, F * 0. In this case. Equation (7.3-1) becomes * * 2 Q^. This converges to 

Q * 0 1? F*l < 1 (stable system). If |*| - 1, Qi remains at the starting value, and thus the steady state 
covariance is not unique. If |«[ > i, the solution diverges or stays at 0, depending on the starting vali'e. 

Case 3: G t 0, C 3 0, F f 0. In this case. Equation (7.3-2) reduces to 

Q = * 2 Q + F 2 (7.3-3) 

For jtj < 1, this equation has a unique, nonnegative solution 

Q » (7.3-4) 

1 - 4> 2 

and convergence of Equation (7.3-1) to this solution is easily shown. If |t| * 1, the solution is negative, 
which is not an admissible covariance, or infinite; in either event. Equation (7.3-1) diverges to infinity. 

Case 4: 6 f 0, C f 0, F * 9. In this case. Equation (7.3-2) is a quadratic equation with roots zero and 
(* 2 - l)G 2 /C 2 . If |«| < 1, the seond root is negative, and thus there is a unique nnnegative root. If 
|*| 3 1, there is a double root at ziro, and th^ solution is still unique. In both of these events, conver- 
gence of Equation (7.3-1) to the solution at 0 is easy to show. If |t| > 1, there are two nonnegative roots, 
and the system can converge to either one, depending on whether or not the initial covariance is zero. 

Case 5: G t 0, C t 0, F f 0. In this case. Equation (7.3-2) is a .uadratic equation with roots 

0 3 (1/2)H ± /(1/4)H 2 + F Z G Z /C Z (7.3-5) 

where 

H = F 2 + ($ 2 - 1)G 2 /C 2 (7.3-6) 

Regardless of the value of 4, the square-root term is always larger *n magnitude than (1/2)H; therefore, there 
is one positive and one negative root. Convergence of Equation (7.3-1) to the positive root is easy to show. 

Let us now summarize the results of these five cases. In all well-posed cases, the covariance converges 
to a unique value if the system is stable. For unstable or margina’lly stable systems, a unique converged value 
is assured if both C and F are nonzero. For one-dimensional systems, there is also a unique convergent solu- 
tion for |»J = 1, G f 0, C f 0, F s 0; +his case illustrates that the conditions of Theorem (7.3-1) are not 
necessary, although they are sufficient. 

Heurist ically, we can say that observability (C f 0) prevents the covariance from diverging to infinity 
for unstable systems. Controllability by the process noise (F f 0) ensures uniqueness by eliminating the 
possibility of perfect prediction (Q = 0). 

An important related question to consider is the stability of the filter. We define the corrected error 
vector to be 


(7.3-7) 


Using Equations (7.0-1), (7.2-15), (7.2-16), and (7.2-19) gives the recursive relationship 

e. +l 3 (I - KC)$e i + (I - KG)Fn,j - KGn i+1 (7.3-8) 

where 

K 3 PC*(GG*)” X 3 QC*(CQC* + GG*)' 1 (7.3-9) 

We can snow that, given the conditions of Theorem (7.3-1), the system of Equation (7.3-8) is stable. This 
stability implies that, in the absence of new disturbances, (noise) errors In the state estimate will die out 
with time; furthermore, for bounded disturbances, the errors will always be bounded. A rigorous proof is not 
presented here. 

It is interesting to examine the stability of the one-dimensional example with G f 0, C f 0, F - 0, and 
|*| » 1. We previously noted that Q-j for this case cor-'erges to 0 for all initial covariances. Let us 
examine the steady-state filter. For this case. Equation (7.3-8) reduces to 


which is only marginally stable. Recall that this case did not meet the conditions of Theorem (7.3-1), so our 
stability guarartee does not apply. Although a steady-state filter exists, it does not perform at all like the 
time-varying filter. The time-varying filter reduces the error to zero asymptotically with time. The steady- 
state filter has no feedback, and the error remains at Its Initial value. Balakrlshnan (1984) discusses the 
steady- state filter In more detail. 

Two special cases of time-invariant Kalman filters deserve special note. The first case is where F is 
zero and the system Is stable (and GG* must be Invertible to ensure a well-posed problem). In this case, the 
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steady state Kalman gain K is zero. The Kalman filter simply integrates the state equation, ignoring any 
available measurements. Since the system is stable and has no disturbances, the error will decay to zero. 

The same filter is obtained for nonzero F if C is zero or if G is Infinite. The error does not then 
decay to zero, but the output contains no useful information to feed back. 

The second special case is where G is zero and C is square and invertible. FF* must be invertible 
to ensure a well-posed problem. For this case, the Kalman gain is C' 1 . The estimator then reduces to 

X, * C- 1 ^ - D Ui ) (7.3-11) 

which ignores all previous information. The current state can be reconstructed exactly from the current mea- 
surement, so there is no need to consider past data. This is the antithesis of the case where F is 0 and no 
information from the current measurement is used. Most realistic systems lie somewhere berwee.. these two 
extremes. 


7.4 CONTINUOUS TIME 

The form of a linear continuous-time system model is 

x(t) = Ax ( t ) + Bu{t) + F c n ( t ) (7.4-la) 

z(t) - Cx(t) + Du (t) + G c n(t) (7.4-lb) 


where n and n are assumed to be zero-mean white-noise processes v th unity power spectral density. The 
input u is assumed to be known exactly. As in the discrete-time analysis, we will simplify the notation by 
assuming that the system is time invariant. The same derivation applies to time-varying systems by evaluating 
the matrices at the appropriate time points. 

We will analyze Equation (7.4-1) as a limit of the discrete-time systems 

x(t i + A) s (I + aAJxU^ + ABu(t i ) + A 1 ^ 2 F c n(t i ) (7.4-2a) 

* Cx(t,) + Mtp ♦ (7.4-2b) 

where n and n are discrete-time white-noise processes wiih identity covariances. The reasons for the a 1 / 2 
factors were discussed in Section 6 2. 


The filter for the system of Equation (7.4-2) is obtained by making appropriate substitutions in Equa- 
tions (7.2-13) to (7,2-17). We need to substitute (I + aA) in place of *, aB in place of ¥, aF c F* in place 
of FF*, and a* 1 G c Gc in place of GG*. Combining Equations (7.2-13), (7.2-14), and (7.2-17) and making the 
substitutions gives 


X ( t^ + A) = (I + AAjxft.j) + ABu^) + AP(t i + A)C*(G c GJ)" 1 [z(t i + A) - C(I + AA)*^) - CABu(t.) - Du(t i + a)] 

(7.4-3) 

Subtracting x(t^) and dividing by a gives 
x(t, + a) - x(tj 

2 = Ax(t.) + Bu(t i ) + P(t i + A)C*(G c G c )‘ l [2(t i + A) - C(I + AA)x(tj) - CABu(t^) - Duft,)] 

(7.4-4) 


Taking the limit as A + 0 gives the filter equation 

*(t) = Ax(t) + Bu(t) + P(t)C*(G c G*)' 1 [z(t) - Cx(t; - Du(t)] (7.4-5) 

It remains to find the equation for P(t). First note that Equation (7.2-15) becomes 

(H^ + a) ■ (I + AA)P(t.)(I + aA)* + aF c FJ (7.4-6) 

and thus 

l<m Q(t. + A) • P(tJ (7.4-7) 

A+o 1 1 


Equation (7.2-18) is a more convenient form for our current purposes than (7.2-16). Make the appropriate sub- 
stitutions in Equation (7.2-18) to get 

P(t i + A) « r)(t 1 + a) - Q(t 1 + A)C*(CQ(t 1 + a)C* + A' 1 G c G*)’ 1 CQ(t 1 + a) 


Subtract P(t-j) and divide by a to give 


P(t i + A) - P(t 1 ) 


Q(tj + A) - P(t<) 

1 L - Q(t 1 + A)C*(ACQ(t i + A)C* + G c G*)- 1 CQ(t 1 + a) 


A 


A 


(7.4-9) 
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For the first term on the right of Equation (7.4*9), substitute from Equation (7.4-7) to get 
Q(t i + a) - P(t.) 

} ~ — = AP(t.) + P(t i )A* + AAP(t i )A* + F C F* (7.4-10) 

Thus in the limit Equation (7.4-9) becomes 

P(t) = AP(t) + P(t)A* + F C FJ - PCt)C*(G c G*)" l CP(t) (7.4-11) 

Equation (7.4-11) is the continuous-time Ricatti equation. The initial condition for the equation is P 0 - 0, 
the covariance of the initial state. P 0 is assumed to be known. Equations (7.4-51 and (7.4-11) constitute 
the solution to the continuous-time filtering problem for linear systems with white process and measurement 
noise. The continuous-time filter requires 66* to be nonsingular. 

One point worth noting about the continuous-time filter is that the innovation z(t) - z(t) is a white- 
loise process with the same power spectral density as the measurement noise. (They are not, however, the same 
process.) The power spectrum of the innovation can be found by looking at the limit of Equation (7.2-33). 
Making the appropriate substitutions gives 

cov(v(t.)} = CQ(t * )C* + A- 1 6 C 6J (7.4-12) 

The power spectral density of the innovation is then 

lim A' 1 cov(v(t.)) = 6 G* (7.4-13) 

A+o ICC 

The disappearance of the first term of Equation (7.4-12) in the limit makes the continuous-time filter simpler 
than the discrete-time one in many ways. 

For time-invariant continuous-time systems, we can investigate the possibility that the filter reaches a 
steady state. As in the discrete-time steady-state filter, this outcome would result in a significant compu- 
tational advantage. If the steady-state filter exists, it is obvious that the steady-state P(t) must satisfy 
tne equation 


AP + PA* + F C FJ - PC*(6 c GJ)' 1 CP * 0 (7.4-14) 

ained by setting P to 0 in Equation (7.4-11). The eigenvector decomposition methods referenced after 
nation (7-3-2) are also the best practical numerical methods for solving Equation (7.4-14). The following 
theorem, comparable to Theorem (7.3-1), is not proven here. 

Theorem 7.4-1 If all unstable or neutrally stable modes of the system are 
controllable by the process noise and are observable, and if GqG£ is 
invertible, then Equation (7.4-14) has a unique positive semidefinite solu- 
tion, and P(t) converges to this solution for all choices of the initial 
covariance P 0 . 

Proof See Kailath and Lyung (1976), Balak.'i shnan (1981), or Kalman and 
Bucy (1961). 


7.5 CONTINUOUS/DISCRETE TIME 

Many practical applications of filtering involve discrete sa;. led measurements of systems with continuous- 
time dynamics. Since this problem has elements of both discrete and continuous time, there is often debate 
over whether the discrete- or continuous-time filter is more appropriate. In fact, neither of these filters 
is appropriate because they are both based on models that are not realistic representations of the true system. 
As Schweppe (1973, p. 206) says. 

Some rather interesting arguments sometimes result when one asks the question, 

Are the discrete- or the continuous-time results more useful? The answer is, 
of course, that the question is stupid neither Is superior In all cases. 

The appropriate model for a continuous-time dynamic system with discrete-time measurements is a continuous- time 
model with discrete-time measurements. Although this statement sounds like a tautology, its point has been 
missed enough to make it worth emphasizing. Some of the confusion may be due to the mistaken impression that 
such a mixed model could no’ be analyzed with the available tools. In fact, the derivation of the appropriate 
filter Is trivial, given the pure continuous- and pure discrete-time results. The filter for this class of 
problems simply involves an appropriate combination of the discrete- and continuous-time filters previously 
derived. It takes only a few lines to show how the previously derived results fit this problem. We will spend 
most of this section talking about implementation Issues In a little more detail. 

Let the system be described by 

x(t) = Ax(t) + Bu(t) + F c n(t) (7. 5-la) 

z(t.) = Cxvt t ) + DuU^ + Gn(t 1 ) i s 1,2,..- (7. 5-lb) 
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Equation (7.5-la) is identical to Equation (7.4-la); and, except for a notation change. Equation (7.5-lb) is 
Identical to Equation (7. 0-lb). Note that the observation is only defined at the discrete points t^, 
although the state is defined in continuous time. 


Between the times of two observations, the analysis of Equation (7.5-1) is identical to that of Equa- 
tion (7.4-1) with an infinite G matrix or a zero C matrix; either of these conditions is equivalent to 
having no useful observation. Let x ( t-j ) be the state estimate at time tj based on the observations up to 
and including z(t-|). Then the predicted estimate in the interval (tj,tj+ A ] is obtained f. ,i 


The covariance of the prediction is 


x(t*) ' x(t i ) 

(7.5-2) 

x(t) * Ax(t) + Bu(t) 

(7.5-3) 

Q(t*) = P(t 1 J 

(7.5-4) 

0(t) = AQ(t) + Q(t)A* + F C FJ 

(7.5-5) 


Equation? (7.5-3) and (7.5-5) are obtained directly by substituting C = 0 in Equations ( 7 .4-5) and (7.4-11). 
The notation has been changed to indicate that, because there is no observation in the interval, these are 
predicted estimates; whereas, in the pure continuous-time filter, the observations are continuously used and 
filtered estimates are obtained. Integrate Equations (7.5-3) and (7.5-5) over the interval (t-j,ti + a) to 
obtain the predicted estimate x ( t-j + a) and its covariance Q(tj + a). 

In practice, although u(t) is defined continuously, it will often be measured (or otherwise known) only 
at the time points tj. Furthermore, the integration will likely be done by a digital computer wnich cannot 
integrate continuous-time data exactly. Thus Equation (7.5-3) will be integrated numerically. The simplest 
integration approximation would give 

x(t. + A) ~ (i + A)x(t*) + ABu(t.) (7.5-6) 

This approximation may be adequate for some purposes, but it is more often a little too crude. If the 
A matrix is time-varying, there are icverel reasonable integration schemes which we will not discuss here; 
the most common are based on Runge-KutU algorithms (Acton, 1970). For systems with time-invariant A 
matrices and constant sample intervals, me transition matrix is by far the most efficient approach. First 
define 

<f> = exp(AA) (7.5-7) 

#A 

¥ =1 exp(At)dt B (7.5-8) 

*0 


x(t i + A) S 4 >x(tt) + wu(t i ) (7.5-9) 

This approximation is the exact solution to Equation (7.5-3) if u(t) holds its value between samples. 

Wiberg (1971) and Zadeh and Desoer (1963) derive this solution. Moler an^ Van Loan (1978) discuss various 
means of numerically evaluating Equations (7.5-7) and (7.5-8). Equation (7.5-9) has an advantage of beinq 
in the exact form in which discrete-time systems are usually written (Equation (7.0-la)). 

Equation (7.5-9) introduces about 1/2-sample delay In the modeling of the response to the control input 
unless the continuous-time u(t) holds its value between samples; this delay Is often unacceptable. 

Figure (7.5-1) shows a sample input signal and the signal as modeled by Equation (7.5-9). A better approxima- 
tion is usually 


x(t, + A) a $x(tj) + (l/2)f(u(t.|) + u(t i + a)) (7.5-10) 

This equation models u(t) between samples as being constant at the average of the two sample values. 

Figure (7.5-2) illustrates this model. There is little phase lag in the model represented by Equation (7.5-10) , 
and the difference in implementation cost between Equations (7.5-9) and (7.5-10) is negligible. Equa- 
tion (7.5-10) is probably the most commonly used approximation method with time-invariant A matrices. 

The high-frequency content introduced by the jumps in the above models can be removed by modeling u ( t ) as 
a linear Interpolation between the measured values as Illustrated in Figure (7.5-3). This model adds another 
term to Equation (7.5-10) proportional to u(ti + A) - u(t-f). In our experience, this degree of fidelity is 
usually unnecessary, and is not worth the extra cost and complication. There are some applications where the 
accuracy required might justify this or even more complicated methods, such as higher-order spline fits. (The 
linear interpolation Is a first-order spline.) 

If you are using a Runge-Kutta algorithm instead of a transition-matrix algorithm for solving the differ- 
ential equation, llneai interpolation of the Input Introduces negligible extra cost and is ccronon practice. 

Equation (7.5-5) doe: not involve measured data and thus does not present the problems of Interpolating 
between the measurements. The exact solution of Equation (7.5-5) is 
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+ f A 

Q(t 1 + a) - •Q(tj)#* + JT exp(A<A - t))F c FJ exp(A*(A - x))dT (7.5-11) 

as can be verified by substitution. Note that Equation (7.5-11) is exactly In the form of a discrete-time 
update of the covariance (Equation (7.2-15)) If F is defined as a square root of the integral term. For 
small a , the integral term Is well approximated by aF c FJ, resulting In 

Q(t,- + a) = »Q(t})** + flF c FJ (7.6-12) 

The errors in th approximation are usually far smaller than the uncertainty In the value of F c , and can thus 
be neglected. This approximation is significantly better than the alternate approximation 

QUj + A) S Q(tj) + AAQ(tJ) + AQ(tj)A* . aF c FJ (7.5-13) 

obtained by inspection from Equation (7.5-5). 

The above discussion has concentrated on propagating the estimate between measurements, i.e., the time 
update. It remains only to discuss the measurement update for the discrete measurements. We have x(tj) and 
Q(t{) at some time point. We need to use these and the measured data at the time point to obtain x(t^j and 
P(t<|). This Is Identical to the discrete-time measurement update problem solved by Equations (7.2-16) 
and (7.2-17). We can also use the alternate forms discussed in Section 7.2.4. 

To start the filter, we are given the a priori mean x(t 0 ) and covariance Q(t 0 ) of the state at time t 0 . 
Use Equations (7.2-16) and (7.2-17) (or alternates) to obtain x(t 0 ) and P(t 0 ). Integrate Equations (7.5-2) 
to (7.5-5) from tj to t x by some means (most likely Equations (7.5-10) and (7.5-12)) to obtain x(t x ) and 
Q ( t x > . This completes one time step of the filter; processing of subsequent time points uses the same 
procedure. 

The solution for the steady-state form of the discrete/continuous filtar follows immediately from that of 
the discrete-time filter, because the equations for the covariance updates are identical for the two filters 
with the appropriate substitution of F in terms of F c . Theorem (7.3-1) therefore applies. 

We can surrmarize this section by saying that there is a continuous/discrete- time filter derived from 
appropriate results In the pure discrete- and pure continuous-time analyses. If the Input u holds its value 
between samples, then the form of the continuous/discrete filter is identical to that of the pure discrete-time 
filter with an appropriate substitution for the equivalent discrete-time process noise covariance. For more 
realistic behavior of u, we must adopt approximations if the analysis is done on a digital computer. It is 
also possible to view the continuous-time filter equations as giving reasonable approximations to the 
continuous/discrete-time filter in some situations. In any event, we will not go wrong as long as we recognize 
that we can write the exact filter equations for the continuous/discrete-time system and that we must consider 
any other equations used as approximations to the exact solution. With this frame of mind we can objectively 
evaluate the adequacy of the approximations Involved for specific problems. 


7.6 SMOOTHING 

The derivation of optimal smoothers draws heavily on the derivation of the Kalman filter. Starting from 
the filter results, only a single step is required to compute the smoothed estimates. In this section, we 
briefly derive the fixed- interval smoother for discrete-time linear systems with additive Gaussian noise. 
Fixed-interval smoothers are the most widely used. The s*me general principles apply to deriving fixed-point 
and fixed-lag smoothers. See Meditch (1969) for derivations and equations for fixed-point and fixed-lag 
smoothers and for continuous- time forms. 

There are alternate computational forms for the fixed-interval smoother; these forms give mathematically 
equivalent results. We will not discuss computational advantages of the various forms. See Bierman (1977) 
and Bach and Wlngrove (1933) for alternate forms and discussions of their advantages. 

Consider the fixed-interval smoothing problem on an interval with N time points. As In the filter 
derivation, we will concentrate on two time points at a time in order to get a recursive form. It Is straight- 
forward to write an explicit formulation for the smoother, like the explicit filter form of Section 7.1, but 
such a form Is Impractical. 

In the nature of recursive derivations, assume that we havo previously computed Xf+, t the smoothed esti- 
mate of Xj +1 , and S<+ x , the covariance of xj +1 given Zu. We seek to derive an expression for x^ and Si. 
Note that this recursion runs backwards In time Instead or forwards; a forward recursion will not work, for 
reasons which we will see later. 

The smoothed estimates, Xi and *i +1 , are defined by 

[. 

We will use the measurement partitioning Ideas of Section 5.2.2, with the measurement Zr partitioned Into 
Zj and 

Z 1 ” < z Hi 2 N> 


*1 
1 + 1 , 


Hlh} 


(7.6-1) 


(7.6-2) 
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From the derivation of the Kalman 
tioned on Zj. It Is Gaussian with 


filter, we can write the joint distribution of and xj+ x condi- 



(7.6-3) 


(7.6-4) 


We did not previously derive the cross term in the above covariance matrix. 

El(x 1 - * 1 )(x 1+1 - x j+ ,)*} * E((x 1 - i^Ktx. + fu 1 + Fn i - *x 

* E{(x 1 - x i )(x< - x 1 ) i )**+E{(x i 


To derive the form shown, write 
- fu,)*> 

^HFnj)*} 


* P^* + 0 (7.6-5) 

For the second step of the partitioned algorithm, we consider the measurements Z-j, using Equa- 
tions (7.6-3) and (7.6-4) for the prior distribution. The measurements 1 i can be written in the form 

c C^x i+l + 5^ + (7.6-6) 

for some matrices Ci, Of, and Gi, and some Gaussian, zero-mean, Identity-covariance noise vector tw. 

Although we could laboriously wrUe out expressions for the matrices in Equation (7.6-6), this step is unneces- 
sary; we need only know that such a form exists. The Important thing about Equation (7.6-6) is that xj does 
not appear In it. 

Using Equations (7.6-3) and (7.6-4) for the prior distribution and Equation (7.6-6) for the measurement 
equation, we can now obtain the joint posterior distribution of xj and xj+ x given Zf. This distribution is 
Gaussian with mean and covariance given by Equations (5.1-12) and (5.1-13), substituting Equation (7.6-3) for 
ni£. Equation (7.6-4) for P, 6^ for D, Gi for G, and 

C » [01^] (7.6-7) 


By definition (Equation (7.6-1)), the mean of this distribution gives the smoothed estimates Xj and 
Xf+i. Making the substitutions into Equation (5.1-12) and expanding gives 


(C^CJ ♦ (7 - 6 - 8) 

i+iSJ 

We can solve Equation (7.5-8) for xj in terms of xj+j, which we assume to have been computed In the previous 
step of the backwards recursion. 





' *14I> 


(7.6-9) 


Equation (7.6-9) Is the backwards recursive form sought. Note that the equation does not depend explic- 
itly on the measurements or on the matrices in Equation (7.6-6). That Information Is all subsumed in x^+ x . 
The "Initial" condition for the recursion Is 

x N * x N (7.6-10) 

which follows directly from the definitions. We do not have a corresponding known boundary condition at the 
beginning of the Interval, which Is why we must propagate the smoothing recursion backwards. Instead of 
forwards. 


We can now describe the complete process of computing the smoothed state estimates for a fixed time Inter- 
val. First propagate the Kalman filter through the entire Interval, saving all of the values k\ t ku P-j, and 
Q^, Then propagate Equation (7.6-9) backwards in time, using the saved values from the filter, and starting 
from the boundary condition given by Equation (7.6-10). 


We can derive a formula for the smoother covariance by substituting appropriately into Equation (5.1-13) 
to get 






(7.6-11) 


(The off-diagonal blocks are not relevant to this derivation.) We can solve Equation (7,6-11) for S\ in 
terms of $f+ lf giving 


* * * 
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S 1 " P i - WlxttHi * S i + i)Qlii #P i <^- i2 > 

This gives us a backwards recursion for the smoother covariance. The "initial" condition 

S N * P N (7.6-13) 

follows from the definitions. Note that, as in the recursion for the smoothed estimate, the measurements and 
the measurement equation matrices have dropped out of Equation (7.6-12). All the necessary data about the 
future process is subsumed in $-| +1 . Note also that it is not necessary to compute the smoother covariance Si 
In order to compute the smoothed estimates. 


7.7 NONLINEAR SYSTEMS AND NON-GAUSSIAN NOISE 


Optimal state estimation for nonlinear dynamic systems is substantially more difficult than for linear 
systems. Only in rare special cases are there tractable exact solutions for optimal filters for nonlinear sys 
terns. The same comments apply to systems with non-Gaussian noise. 

Practical implementations of filters for nonlinear systems invariably involve approximations. The most 
common approximations arr based on linearizing the system and using the optimal filter for the linearized 
system. Similarly, non-uaussian noise is approximated, to first order, by Gaussian noise with the same mean 
and covariance. 


Consider a nonlinear dynamic system with additive noise 

x(t) = f(x(t),uU)) + n(t) (7.7-la) 

z^) = g(x(t 1 ).uU 1 )) + n< (7.7-lb) 

Assume that we have some nominal estimate, x n (t), of the state time history. Then the linearization of 
Equation (7.7-1) about this nominal trajectory is 


where 


x(t) • A(t)x ( t) + B(t)u{t) + f n (t) + n(t) 
z(t i ) = C(t 1 )x(t 1 ) + D(t i ) + g n (t.) + n 1 

(7.7-2a) 

(7.7-2b) 

A(t) ■ v x f(x n (t),u(t)) 

(7.7-3,) 

B{t) * 7 u f(x n U),u(t)) 

(7.7-3b) 

C(t) » v x g(x n (t),u(t)) 

(7.7-3c) 

0(0 * v u g(x n (t),u(t)) 

(7. 7-1(0 

f n (t) * f(x n (t).u(t)) 

(7.7-4a) 

g n (0 = g(x n (t),u(t)) 

(7.7-4b) 


For a given nominal trajectory, Equations (7.7-2) to (7.7-4) define a time-varying linear system. The Kalman 
fi Iter/smoother algorithms derived in previous sections of this chapter give optimal state estimates for this 
lineari 2 ®d system. 

The filter based on this linearized system Is called a linearized Kalman filter or an extended Kalman 
filter (EKF) . Its adequacy as an approximation to the optimal filter for the nonlinear system depends on 
several factors which we will not analyze in depth. It Is a reasonable supposition that If the system is 
nearly linear, then the linearized Kalman filter will be a close approximation to the optimal filter for the 
system. If, on the other hand, nonlinearities play a major role in defining the characteristic system 
responses, the reasonableness of the linearized Kalman filter is questionable. 

The above description is intended only to introduce the simple*: c locus of linearized Kalman filters. 
Starting from this point, there are numerous extensions, modifications, and nuances of application. Nonlinear 
filtering is an area of current research. See Bach and Wingrove ( 1 983) and Cox and Bryson (19801 for a few of 
the many investigations In this field, Schweppe (1973) and Jazwinski (1970) h*ve fairly extensive discussions 
of nonlinear state estimation. 
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8.0 OUTPUT ERROR METHOD FOR DYNAMIC SYSTEMS 

In previous chapters, we have covered the static estimation problem and th: estimation of the state of 
dynamic systems. With this background, we can now begin to address the principle subject of this book, estima- 
tion of the parameters of dynamic systems. 

Before addressing the more difficult parameter estimation problems posed by more general system models, 
we will consider the simplified case that leads to the algorithm called output error. The simplification that 
leads to the output-error method Is to omit th * process-noise term from the state equation. For thlc reason, 
the output-error method Is often desc Ibed by terms like "the no-p^ocess-nolse algorithm" or "the measurement- 
noise-only algorithm," 

We will first discuss mixed contlnuous/dlscrete-tlme systems, which are most appropriate for the majority 
of the practical applications. We will follow this discussion by a brief summary of any differences for pure 
discrete-time systems, which are useful for some applications. The derivation and results are essentially 
identical. The pure continuous-time results, although similar In expression, involve extra complications. We 
have never seen an appropriate practical application of the pure continuous-time results; we therefore feel 
justified In omitting them. 

In mixed continuous/discrete time, the most general system model that we will seriously consider is 


x(t) * f[x(t),u(t),e] 

zftj) * g£x(t i ) ,u(t 1 ) + G(Un 1 i « 1,2,.. 


(6.0- la) 
(8.0-lb) 
(3.0-lc) 


The measurement noise n Is assumed t be a sequence of independent Gaussian random variables with zero mean 

and identity covariance. The Input u <s assumed to be known exactly. The initial condition x 0 can be 

treated in several ways, as discussed In Section 8.2. In general, the functions f and g can also be explicit 
functions of t. We omit this from the notation for simplicity. (In any event, explicit time dependence can 

be pul in the notation of Equation (8.0-1) by defining an extra control equal to t.) 


The corresponding nonlinear model for pure discrete-time systems is 


x(t fl ) * x 0 

x(t m ) = f[x(t i ) 1 u(t 1 ),c] 
z(t 1 ) « g[x(t 1 ) ,u( t i ) ,c] ♦ G(c)n 1 
The assumptions are the same as In the continuous/discrete ase. 


1 « 0 , 1 ,... 


1 « 1 , 2 ,... 


(8.0-2a) 

(8.0-2b) 


(8.0-2c) 


Although the output-error method applies to nonlinear systems, we will give special attention to the 
treatment of linear systems. The linear form of Equation (8.0-1) is 


>(t fc ) = x 0 

x ( t ) - Ax(t) + Bu (t) 
z(t i ) * Cx(t 1 ) + Du(t.j ) + Gt^ 


1 - 1 . 2 ,.,. 


(8.0-3a) 

(8.0-3b) 

(8.0-3c) 


The matrices A, B, C, D, and G are functions of we will not complicate the jtatlon by explicitly Indi- 
cating this relationship. Of course, x and z are also functions of t through ie1r dependence on the 
system matrices. 

In general, the matrices A, B, C, D, and G can also oe functions of time. For notatlonal simplicity, we 
have not explicitly indicated this dependence. In several places, time invariance of the matrices introduces 
significant computational savings. The text will indicate such situations. Note that £ cannot be a function 
of time. Problems with time-varying c must be reformulated with a tlmrr- invariant c in order for the tech- 
niques of this chapter to be applicable. 


"he linear form of Equation (8.0-2) Is 


(8.0-46 ) 
(8.0-4b) 

(8.0-4c) 


x{t 1+l ) • ♦xftj) + '*'u(t 1 ) 1 - 0,1,... (8.0-4b) 

z(t 1 ) - Cx(t 1 ) + Du(t,j ) + Gn t 1 ■ 1,2,... (8.0-4c) 

The transition matrices t and f are functions of e, and possibly of time. 

For any of the model forms, a prior distribution for e may or may not exist, depending on the particular 
application. When there Is no prior distribution, or when you desire to obtain an estimate independent of tne 
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prior distribution, use a maximum-likelihood estimator. When a prior distribution Is considered, MAP estimat- 
ors are appropriate. For the parameter estimation problem, a posteriori expected-value estimates and Bayesian 
optimal estimates are Impractical to compute, except In special cases. The posterior distribution of £ is 
not, in general, symmetric; thus the a posteriori expected value need rot equal the MAF estimate. 


8.1 DEk'VATION 

The basic method of derivation for the output-error method Is to reduce the problem to the stativ form of 
Chapter 5. We will see that the dynamic system makes the models fairly complicated, but not different In any 
essential way from those of Chapter 5. We first consider the case where U and the Initial condition are 
assumed to be known. 


Choose ar. arbitrary value of £. Given the initial conditio. x 0 and a specific Input time-history u, 
the state equation (8.0-lb) can be solved to give the state as a function of time. We assume that f Is 
sufficiently smooth to guarantee the existence and uniqueness of the solution (Brauer and Noel, 1969). For 
complicated f functions, the solution may be difficult or Impossible to 'xpress ir closed form, bu*. that 
aspect is Irrelevant to the theory. (The practical Implication is that the solution will be obta* r .ed using 
numerical approximation methods.) The important thing to note Is that, because of the elimination of the 
process noise, the solution Is deterministic. 


For a specified input u, the system state is thus a deterministic function of £ and time. For consis- 
tency with the rotation of the filter-error method discussed later, denote this function by Xr(t). The £ 
subscript emphasizes the dependence on £. The dependence on u Is not relevant to the current discussion, 
so the notation ignores this dependence for simplicity. Assuming known G, Equation (8.0-lc) then becomes 

2^) * g[^^(t i ),u(t i ),£] ♦ G ni 1 * 1,2,... (8.1-1) 


Equation (8.1-1) Is in the form of Equation (5.4-1); It is a static nonlinear model with additive roise. There 
are multiple experiments, crj at each t-j. The estimators of Section 5.4 appl> directly. The assumptions 
adopted have a' lowed us to solve the system dynamics, leaving an essentially static problem. 


The MAP estimate 
tion becomes 

0(0 


is obtained by minimizing Equation (5.4-9). Ir the notation of this chapter, this equa- 
N 

* | Z Wm) - i c (t 1 )]*(6G*)- 1 [2< t 1 ) - i^)] + 7 - tn c ) (8.1-2) 

1 E 1 


where 


Mt # ) * x ( (8.1-3s) 

* t (t) * f[x 4 (t),u(t).c] (8.1-3b) 

* g[7. ( (t i ),u(t i ).cJ 1 - 1.2.... (8-1-3C 


The quantities m^ and P are the mean and covariance of the prior distribution "* £, as In Chapter 5. For 
the MLE estimator," omit the last term of Equation (8.1-2), giving 


0(0 



WV - z^(t 1 )]*(GG*)- 1 [z(t i ) - z^f^)] 


(8.1-4) 


Equation (8.1-4) Is a quadratic form in the difference between the measured response (output), and z^, tne 
response computed from the deterministic part of the system modei. This motivates the name "output error." 

The minimization of Equation (8.1-4) Is an Intuitively plausible estimator defensible even without statistical 
derivation. The minimizing value of £ gives the system model that best approximates (In a least-squares 
sense) the actual system response to the test input. Although this doe f > not necessarily guarantee that the 
model response and the system response will be similar for other test Input':., the minimizing value of £ Is 
certainly a plausible e' ate. 

The estimates that result Vom minimizing Equation (8.1-4) are sometimes celled "least squares" estimates, 
.i reference to the quadratic form of the equation. We prefer tj avoid the us® of this terminology because it 
Is potentially confusing. Many of the estimators applicable to dynamic system* i;**ve a least-squares form, so 
the term Is not definitive. Furthermore, the term "least squares" Is most often applied Iquatlon (8.1-4) 
to contrast it from other forms labeled "maximum likelihood" (typ.c^jly the estimators of Section 8.4, which 
apply to unknown G, or tne estimators of Chapter 9, which acrr-*nt for p-ocess ncise). Ink contrast Is mis- 
leading because Equation (8.1-4) describes a completely ■ Igorous, maximum- » ikcl IJ.ood estimator f nr the problem 
as posed. The differences between Equation (8.1-4) and the estimators of Sections and Chapter 9 are 
differences In th' problem statement, not differences In the statistical principles used for solution. 

To derive the output-error method for pure discrete-time systems, substitute the discrete-time Equa- 
tlcn (8.0-2b) in place of Equation (8.0-lb). The derivation and the result are unchanoea except, that Equa- 
tion (8. l-3b) becomes 

Wi) - f[x t (t 1 ).u(t 1 ).c] 1 * 0,1,... 
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8.2 INITIAL CONDITIONS 

The above derivation of the output-error method assumed that the Initial condition was known exactly. 

This assumption is seldom strictly true, except when using forms where the initial condition is zero by 
definition. 

The Initial condition is typically based on Imperfectly measured data. This characteristic suggests 
treating the initial condition as a random variable with some mean and covariance. Such treatment, however, is 
incompatible with the output-error method. The output-error method is predicated on a deterministic solution 
of the state equation. T ;*tment of a random initial condition requires the more complex filter-error method 
discussed later. 

If the system is stable, then initial condition effects decay to a negligible level in a finite time. 

If this decay is sufficiently fast and the error in the initial condition is sufficiently small, the initial 
condition error will have negligible effect on the system response and can be ignored. 

If the errors in the initial condition are too large to justify neglecting them, there are several ways to 
resolve the problem without sacrificing the relative simplicity of the output-error method. One way is to 
simply improve + he Initial-condition values. This is sometimes trivially easy if the initial -condition value 
is computed f’'om the measurement at the first time point of the maneuver (a common practice): change the start 
time oy on^ ^.iple to avoid ar. obvious wild point, average the first few data points, or draw a fairing through 
the noise r» .se the faired value. 

When these methods are Inapplicable or insufficient, we can include the initial condition in the list of 
unknown parameters to estimate. The initial condition is then a deterministic function of 5 . The solution 
of the state equation is thus still a deterministic function of 5 and time, as required for the output-error 
method. The equations of Section 5.1 still apply, provided that we substitute 

x c (t 0 ) = X 0 U) (8.2-1) 

for Equation (8.3-la). 

It is easy to show that the initial-condition estimates have poor asymptotic properties as the time 
interval increases. The initial -condition information is all near the beginning of the maneuver, and increas- 
ing the time interval does not add to this information. Asymptotically, we can and should ignore initial con- 
ditions for stable systems. This is one case where asymptotic results are misleading. For real data with 
finite time intervals we should always carefully consider initial conditions. Thus, we avoid making the 
mistake of one published paper (which we will leave anonymous) which blithely set the model initial condition 
to zero in spite of clearly nonzero data. It is not clear whether this was a simple oversight or whether the 
author thought that asymptotic results justified the practice; in any event, the resulting errors were so 
egregious as to render the results worthless (except as an object lesson). 

8.3 COMPUTATIONS 

Equations (8.1-2) and (8.1-3) defire the cost function that must be minimized to obtain the MAP estimates 
(or, in the special case that P _i is zero, the MLE estimates). This is a fairly complicated function of £• 
Therefore we must use an iterative minimization scheme. 

It Is easy to become overwhelmed by the apparent complexity of J as a function of z^(t-j) is itself 
a complicated function of t, involving the solution of a differential equation. To get J as a function of 
t we must substitute this function for Zf(ti) in Equation (8.1-2). You might give up at the thought of 
evaluating first and second gradients of this function, as required by most iterative optimization methods. 

The complexity, however, is only apparent. It is crucial to recognize that we do not need to develop a 
closed-form expression, the development of which would be difficult at best. We are only required to develop 
a workable procedure for computing the result. 

To evaluate the gradients of J, we need only proceed one step at a time; each step is quite simple, 
Involving nothing more complicated than chain-rule differentiation. This step-by-step process follows the 
advice from Alice in Wonderland: 

The White Rabbit put on his spectacles. "Where shall I begin, please your 
Majesty?" he asked. 

"Beqir. at the beginning," the King said, very gravely, "and go on till you 
come to the end; then stop." 

8.3.1 Gauss-Newton Method 


The cost function Is in the form of a sum of squares, which makes Gauss-Newton the preferred optimization 
algorithm. Sections 2.5.2 and 5*4.3 discussed the Gauss-Newton algorithm. To gather together all the impor- 
tant equations, we repeat the basic equations of the Gauss-Newton algorithm in the notation of this chapter. 
Gauss-Newton Is a quasi-Newton algorithm. The full Newton- Raphson algorithm Is 

* L+1 * k - WJ(l L )]“‘tvJJ« L )] (8*3-1) 

The first gradient is 

N 

y<«> * - £ wq) - vti> ] *< GG *>' 1[ V(ti) ] + (£ • V* rl 
1*1 

C 


(8.3-2) 
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For the Gauss-Newton algorithm, we approximate the second gradient by 

N 

v|J{c) ■ £ tv c z e (t 1 )3*(GG*)- i rv s z c {t 1 )J + P- 1 (8.3-3) 

i*i 

which corresponds to Equation (2.5-11) applied to the cost function of this chapter. Equations (8.3-1) 
through (8.3-3) are the same, whether the system is in pure discrete time or mixed continuous/discrete time. 
The only quantities in these equations requiring any discussion are z^[U) and v^Uj). 

8.3.2 System Response 

The methods for computation of the system response depend on whether the system is pure discrete time 
or mixed continuous/discrete time. The choice of method is also Influenced by whether the system is linear 
or nonlinear. 


Computation of the response of discrete-time systems is simply a matter of plugging into the equations. 
The general equations for a nonlinear system are 


* e (t § ) * *,(e) 


(8.3-4a) 

S e (t, +1 ) * f[x 5 (t i ),u(t i ),f.] 1 

* 0,1,... 

(8.3-4b) 

* g[x 5 (t 1 ),u(t i ),c] i 

* 1,2,... 

(8.3-4c) 

The more specific equations for a linear discrete-time system are 

VV = x « u) 


(8.3-Sa) 

S 5 (t i+ i) - *x t (ti) + fu( ti ) i 

= C»l,.. . 

(8.3-5b) 

z c (t.) * Cx £ (t i ) + Du^) i 

« U2... 

(8.3-5c) 


For mixed continuous/discrete-time systems, numerical methods for approximate integration are required. 
You can use any of numerous numerical methods, but the utility of the more complicated methods is often 
limited by thf* available data. It makes little sense to use a high-order method to integrate the system 
equations between the time points where the input is measured. The errors implicit in interpolating the input 
measurements are probably larger than the errors in the integration method. For most purposes, a second-order 


Runge-Kutta algorithm is probably an appropriate choice: 

i t (t 4 ) * x 0 U) (8.3-6,) 

*£(t i+1 ) * + ^1+i ~ (8.3-6b) 

* c (t i+l ) * x^tj) + (t i+1 - tj) £ {ftx^t^.utt^.O + f[x£(t i+i ),u(t i+l ),c]} (8.3-6c) 

z^t,) - g[x 5 (t 1 ).u(t 1 ).c] (8.3-6d) 

For linear systems, a transition matrix method is core accurate and efficient than Equation (8.3-6). 

* c (t.) = x 0 U) (8.3-7a) 

x £ (t <+l ) = «x ? (t 1 ) + f \ [u(t,) + '-U i+1 )] i e 0,1,... (8.3-7b) 

i^t,) ■ Cx^(t 1 ) * Du(t 1 ) 1 » 1,2,... (8.3-7c) 

where 

* = exp[A(t 1+1 - tj)3 (8.3-8) 


Y 



exp(A*r)dT B 


(8.3-9) 


Section 7.5 discusses the form of Equation (8.3-7b). Holer and Van Loan (1978) describe several ways of 
numerically evaluating Equations (8.3-8) and (8.3-9). In this application, because tj+ x - t<| is small com- 
pared to the system natural periods, simple series expansion works well. 


♦ « I + Aa + + 




(8.3-10) 




»■ up* %£*...» 


(8.3-11) 


* = *1*1 ' t i 


(8.3-12) 


8.3.3 Finite Difference Response Gradient 


It remains to discuss the computation of VrzJtJ, the 3 radi*nt of the system response. There are two 
basic methods for evaluating this gradient: finite-difference differentiation and analytic differentiation. 

This section discusses the finite difference approach, and the next section discusses the analytic approach. 

Finite-difference differentiation is applicable to any model form. The method is easy to describe and 
equally easy to code. Because it is easy to coae, finite-difference differentiation is appropriate for pro- 
grams where quick results are needed or the production workload is small enough that saving program develop- 
ment time is more important than 1r '.proving program efficiency. Because it applies with equal ease to all model 
forms, finite-difference differentiation is also appropriate for programs that must handle nonlinear models, 
for which analytic differentiation is numerically complicated (Jategaonkar and Plaetschke, 1983). 


Jo use finite-difference differentiation, perturb the first element of the 


vector by some small amount 


AC' 1 • Recompute the system response using this perturbed z .vector, obtaining the perturbed system response 
Zp. The partial derivative of the response with respect to J l ) is then approximately 


ad 1 ) “ id 1 ) 


(8.3-13) 


Repeat this process, perturbing each element of z in turn, to approximate the partial derivatives with 
respect to each element of £. The finite-difference gradient is then the concatenation of the partial 
derivatives. 


, . B Pw 
1 «(*) * 


(8.3-14) 


Selection of the size of the perturbations requires some thought. If the perturbation is too large. 
Equation (8.3-13) becomes a poor approximation of the partial derivative. If the perturbation is too small, 
roundoff errors become a problem. 

Some people have reported excellent results using simple perturbation-size rules such as setting the 
perturbation magnitude at IX of a typical expected magnitude of the corresponding z element (assuming that 
you understand the problem well enough to be able to establish such typical magnitudes). You could alterna- 
tively consider percentages of the current iteration estimates (with some special provision for handling zero 
or essentially zero estimates). Another reasonable rule, after the first iteration, would be to use percen- 
tages of the diagonal elements of the second gradient, raised to the -1/2 power. As a final resort (it takes 
more computer time and is more complex), you could try several perturbation sizes, using the results to gauge 
the degree of nonlinearity and roundoff error, and adaptively selecting the best perturbation size. 

nue to our limited experience with the finite difference approach, we defer making specific recomnenda- 
tions on perturbation sizes, but offer the opinion that the problem is amenable to reasonable solution. A 
little experimentation should suffice to establish an adequate perturbation- size rule for a specific class of 
problems. Note that the higher the precision of your computer, the more margin you have between the boundaries 
of linearity problems and roundoff problems. Those of us with 60- and 64-bit computers (or 32-bit computers 
in double precision) seldom have serious roundoff problems and can use simple perturbation-size rules with 
Impunity. If you try to get by with single precision on a 32-bit computer, careful perturbation-size selection 
will be more important. 

8.3.4 Analytic Response Gradient 

The other approach to computing the gradient of the system response is to analytically differentiate the 
system equations. For linear systems, this approach is sometimes far more efficient than finite difference 
differentiation. For nonlinear systems, analytic differentiation is impractical ly clumsy (partially because 
you have to redo it for each new nonlinear model form). We will, therefore, restrict our discussion of 
analytic differentiation to linear systems. 

We first consider pure discrete-time linear systems in the form of Equation (8.3-5). It is crucial to 
recall that we do not need a closed form for the gradient; we only need a method for computing it. A closed- 
form expression would be formidable, unlike the following equation, which is the almost entoarassingly obvious 
gradient of Equation (8.3-5), obtained by using nothing more complicated than the chain rule: 


* V t x # U) (8.3-13.) 

V(t 1+1 ) * ♦<* t i(t 1 )) + (7 t *)S(tj) + (7j)u(t 1 ) 1 • 0,1,... (8.3-13b) 

* C(VgX(tp) ♦ (VjOxttp + (VjDMtp 1 = 1.2,... (8.3-13C) 

Equation (8.3-13b) gives a recursive formula for Otx(t^), with Equation (8.3-13a) as the Initial condition. 
Equation (8.3-13c) expresses VgZ(ti) In terms of trie solution of Equation (8.3-13b). 
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The quantities v^*, v^f, v^C, and VcD In Equation (8.3-13) are gradients of matrices with respect to 
the vector £. The results are vectors, the elements of which are matrices (if you are fond of buzz words, 
these are third-order tensors). If this starts to sound complicated, you will be pleased to know that the 
products like (vrD)u(ti) are ordinary matrices (and indeed sparse matrices- they have lots of zero elements). 
You can compute the products directly without ever forming the vector of matrices in your program. A program 
to implement Equation (8.3-13) takes fewer lines than the explanation. 

We could write Equation (8.3-13) without using gradients or matrices. Simply replace Vr by a/ac^) 
throughout, and then concatenate the partial derivatives to get the gradient of z(t-j). We than have, at 
worst, partial derivatives of matrices with respect to scalars; these partial derivatives are matrices. The 
only difference between writing the equations with partial derivatives or gradients is notational. We choose 
to use the gradient notation because it is shorter and more consistent with the rest of the book. 


Let us look at Equation (8.3-13c) in detail to see how these equations would be implemented in a program, 
and perhaps to better understand the equations. The left-hand side is a matrix. Each column of the matrix is 
the partial derivative of z(t-f) with respect to one element of £: 


V ( t i > 


[act 1 ) 


i(t.) 


H 


T7)* ( V 


«(p) 



(8.3-14) 


The quantity v^x(t^) is a similar matrix, computed from Equation (8.3-13b); thus C(v^x(t^)) is a multiplica- 
tion of a matrix times a matrix, and this is a calculation we can handle. The quantity is the vector of 
matrices 


V 


= I " ac sc ac 1 
l^d 1 ) * 35^) a^ p J 


and the product (v^C)x(t^ ) is 


(8.3-15) 


(8.3-16) 


(Our notation uoes not indicate explicitly that this is the intended product formula, but the other conceivable 
interpretation of the notation is obviously wrong because the dimensions are incompatiblo. Formal tensor 
notation would make the intention explicit, but we do not really need to introduce tensor notation here because 
the correct interpretation is obvious). 

In many cases the matrix aC/H^ will be sparse. Typically these matrices are either zero or have only 
one nonzero element. We can take advantage of such sparseness in the computation. If C is not a function of 

(presumably affects other of the system matrices), then aC/H^' is a zero matrix. If only the 

(k,m) element of C is affected by 5^, then [aC/a£^]x(t.) is a vector with [aC^ ,m V3C^]x(t.)^ in the 

1 ( il 1 

kth element and zeros elsewhere. If more than one element of C is affected by £ u/ , then the result is a 
sum of such terms. This approach directly forms [aC/H^]x{tj), taking advantage of sparseness, instead of 

forming the full aC/3£^ matrix and using a general -purpose matrix multiply routine. The terms (v^D)u(t^), 
(Vr$)x(t<), and (vrt)u(ti) are all similar in form to (v^C )x (t-j ) . The initial condition vpo is a zero 
matrix it x„ is known; otherwise it has a nonzero element for each unknown element of x 0 . 


We r.ow know how to evaluate all of the terms in Equation (8.4-13). This is significantly faster than 
finite differences for some applications. The speed-up is most significant if *, ?, C, and D are functions 
of time requiring significant work to evaluate at each point; straighforward finite difference methods would 
have to reevaluate these matrices for each perturbation. 

Gupta and Mehra (1974) discuss a method that is basically a modification of Equation (8.3-13) for comput- 
ing v^z(t-j). Depending on the number of inputs, states, outputs, and unknown parameters, this method can 
sometimes save computer time by reducing the length of the gradient vector needed for propagation in 
Equation (8.4-13). 


We now have everything needed to implement the basic Gauss-Newton minimization algorithm.- Practical 
application will typically require some kind of start-up algorithm and methods for handling cases where the 
algorithm converges slowly or diverges. The II iff -Maine code, MMLE3 (Maine and II iff, 1980; and Maine, 1981), 
incorporates several such modifications. The line-search ideas (Foster, 1983) briefly discussed at the end of 
Section 2.5.2 also seem appropriate for handling convergence problems. We will not cover the details of such 
practical issues here. 


The discussions of singularities in Section 5.4.4 and of partitioning in Section 5.4.5 apply directly to 
the problem of this chapter, so we will not repeat them. 


8.4 UNKNOWN G 

The previous discussion in this chapter has assumed that the G-matrix is known. Equations (8.1-2) 
and (8.1-4) are derived based on this assumption. For unknown G, the methods of Section 5.5 apply directly. 
Equation (5,5-2) substitutes for Equation (8.1-4). In the terminology of this chapter, Equation (5.5-2) 
becomes ^ 

0(0 * \ £ wtp - I'ftpwuweprwtp - * 5 (t t )] + tn|GU)6(0*| 

1*1 

If G is known, this reduces to Equation (8.1-4) plus a constant. 


(8.4-1) 
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As discussed In Section 5.5, the best approach to minimizing Equation (8.4-1) is to partition the param- 
eter vector into a part eg affecting 6, and a part 4f affecting z. For each fixed G, the Gauss-Newton 
equations of Section 8.3 apply to revising the estimate of 4f. For each fixed 4ft the revised estimate of G 
is given by Equation (5.5-7), which becomes 

N 

s R E tz( V ■ ' (VJWty - (8.4-2) 

1=1 


in the current notation. Section 5.5 describes the axial iteration method, which alternately applies the 
Gauss-Newton equations of Section 8.3 for 4f and Equation (8.4-2) for G. 

The cost function for estimation with unknown G is often written in alternate forms. Although the above 
form is usually the most useful for computation, the following forms provide some insight into the relations of 
the estimators with unknown G versus those with fixed G. When G is completely unknown, the minimization 
of Equation (8.4-1) is equivalent to the minimization of 


J(0 = 



fz (t i ) - z t (t i )][z(t.) - i^)]* 


(8.4-3) 


which corresponds to Equation (5.5-9). Section 5.5 derives this equivalence by eliminating G. It is common 
to restrict G to be diagonal, in which case Equation (8.4-3) becomes 

z r n 

o(c) - n | Yj wv ( j) - v l i> (j)] 

j=i t i =1 

This form is a product of the errors in the different signals, instead of the weighted sum-of-the-errors form 
of Equation (8.1-4). 


► 

j 


(8.4-4) 


8.5 CHARACTERISTICS 


We have shown that the output error estimator is a direct application of the estimators derived in 
Section 5.4 for nonlinear static systems. To describe the statistical characteristics of output error esti- 
mates, we need only apply the corresponding Section 5.4 results to the particular form of output error. 

In most cases, the corresponding static system is nonlinear, even for linear dynamic systems. Therefore, 
we must use the forms of Section 5.4 instead of the simpler forms of Section 5.1, which apply to linear static 
systems. In particular, the output error MLE and MAP estimators are both biased for finite time. Asymptoti- 
cally, they are unbiased and efficient. 

From Equation (5.4-11), the covariance of the MLE output error estimate is approximated by 


cov(t|e) 


* [E ty 5 (t i n^r'ty c (t i 


(8.5-1) 


From Equation (5.4-12), the corresponding approximation for the posterior distribution of ? in an MAP esti- 
mator is 


covU|Z) [v c z e ( t i )]*(GG*)' 1 [7 5 z c (t i )] + P' 


r 


(8.5-2) 
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CHAPTER 9 


9.0 FILTER ERROR METHOD FOR DYNAMIC SYSTEMS 

In this chapter, we consider the parameter estimation problem for dynamic systems with both process and 
measurement noise. We restrict the consideration to linear systems with additive Gaussian noise, because the 
exact analysis of more general systems Is Impractical ly complicated except In special cases like output error 
(no process noise). 

Ths easiest way to handle nonlinear systems with both measurement and process noise is usually to linear- 
ize the system and apply the linear results. This method does not give exact results for nonlinear systems, 
but can give adequate approximations In some cases. 

In mixed continuous/discrete time, the linear system model is 

x(t 0 ) * x 0 (9. 0-la) 

x(t) * Ax(t) + Bu(t) + Fn(t) (9.0-lb) 

z(t.) - Cx(tj) + Du(t.) + Gn t i * 1.2,... (9.0-li ) 

The measurement noise n Is assumed to be a sequence of independent Gaussian random variables with zero m?an 
and identity covariance. The process noise n is a zero-mean, white-noise process, independent of the 
measurement noise, with identity spectral density. The initial condition x„ is assumed to be a Gaussian 
random variable, independent of n and n, with mean x 0 and covariance P 0 . As special cases, P 0 can be 0, 
implying that the initial condition is known exactly; or infinite, implying complete ignorance of the initial 
condition. The input u is assumed to be known exactly. 

As in the case of output error, the system matrices A, B, C, D, F, and G, are functions of £ and may 
be functions of time. 

The corresponding pure discrete-time model is 


x(t 0 ) = Xq 


(9.0-2a) 

x(t 1+i ) - *x(t.) + M^) + Fn. 

i = 0,1,... 

(9.0-2b) 

z(t 1 ) = Cx(t . ) + Ou(t^) + Gn^ 

i = 1,2,... 

(9.0-2c) 

All of the same assumptions apply, except that n is a sequence of independent Gaussian random variables with 


zero mean and identity covariance. 

9.1 DERIVATION 

In order to obtain the maximum likelihood estimate of £, we need to choose l to maximize 
L(€.Z) = p(Zn|c) where 

Z N * [z(t l ),z(t J )...z(t N )]* (9.1-1) 

For the MAP estimate, we need to maximize pUn|Op( 0. In either event, the crucial first step is to find a 
tractable expression for p (Z^ | ^ ) . We will discuss three ways of deriving this density function. 

9.1.1 Static Derivation 


The first means of deriving an expression for p(Z>j | C ) is to solve the system equations, reducing them to 
the static form of Equation (5.0-1). This technique, although simple in principle, does not give a tractable 
solution. We briefly outline the approach here in order to illustrate the principle, before considering the 
more fruitful approaches of the following sections. 

For a pure discrete-time linear system described by Equation (9.0-2), the explicit static expression for 
z(tj) is 

l-i 

zftj) * cAu,) + C Yj + Frtj) + Du^) + Gn 1 (9.1-2) 

j-o 


This Is a nonlinear static model In the general form of Equation (5.5-1). However, the separation of £ 
into £q an<1 £f a $ described by Equation (5.5-4) does not apply. Note that Equation (9.1-2) is a nonlinear 
function of £, even if the matrices are linear functions. In fact, the order of nonlinearity increases with 
the number of time points. The use of estimators derived directly from Equation (9.1-2) is unacceptably diffi- 
cult for all but the simplest special cases, and we will not pursue It further. 

For mixed contl nuous/discrete-time systems, similar principles apply, except that the u> of Equa- 
tion (5.0-1) must be generalized to allow vectors of infinite dimension. The process noise In a mixed 
contl nuous/dlscrete-time system Is a function of time, and cannot be written as a finite-dimensional random 
vector. The material of Chapter 5 covered only finite-dimensional vectors. The Chapter 5 results generalize 
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nicely to infinite-dimensional vector spaces (function spaces), but we will not find that level of abstraction 
necessary. Application to pure continuous-time systems would require further generalization to allow infinite- 
dimensional observations. 

9.1.2 Derivation by Recursive Factoring 

We will now consider a derivation based on factoring p(Zfl|0 by means of Bayes rule (Equation (3.3-12)). 
The derivation applies either to pure discrete-time or mixed continuous/discrete-time systems; the derivation 
is identical in both cases. For the first step, write 

p(z n |0 “ p(z(t N )|z N _ l .e)p(z N _ 1 U) (9.1-3) 

Recursive application of this formula gives 

N 

p(Z n IO = II p(z(t i ) |Z i _ l ,«) (9.1-4) 

i=i 


For any particular £, the distribution of z (t-j ) given is known from the Chapter 7 results; it is 

Gaussian with mean 

2 c (tp e E{*(t f )lz 1 _ 1 ,e} 

* ECCxttp + Ou(t i ) + Gr^lZ^.U 

' Cx^tj) + Du(t i ) (9.1-5) 

and covariance 

R i s cov(z(t i ) |Z. -i ,4) 

= cov(Cx(t.) + Du ( t i ) + 

= CQ(t 1 )C* + GG* (9.1-6) 

Note that ( t-j ) and Zr{t-j) are functions of c because they are obtained from the Kalman filter based on a 
particular value of c; that is, they are conditioned on We use the 5 subscript notation to emphasize 
this dependence. R-j is also a function of s, although our notation does not explicitly indicate this. 

Substituting the appropriate Gaussian density functions characterized by Equations (9.1-5) and (9.1-6) 
into Equation (9.1-4) gives 

N 

US,Z N ) x p(Z N |c) = n 12.^ r 1 ' 1 exp(- I [iftp - z c (t i )]*RT 1 [z(t i ) - z^t,)]} (9.1-7) 

1 = 1 

This is the desired expression for the likelihood functional. 

9.1.3 Derivation Using the Innovation 

Another derivation involves the properties of the innovation. This derivation also applies either to 
mixed continuous/discrete-time or to pure discrete-time systems. 

We proved in Chapter 7 that the innovations are a sequence of independent, zero-mean Gaussian /ariables 
with covariances Rf given by Equation (7.2-33). This proof was done for the pure discrete-time case, but 
extends directly to mixed continuous/discrete-time systems. The Chapter 7 results assumed that the system 
matrices were Known; thus the results are conditioned on £. The conditional probability density function of 
the innovations is therefore 

N 

P(V N U) ■ n iarRjf 1 ' 2 exp(- \ VJR^V.) (9.1-8) 


We also showed In Chapter 7 that the innovations are an invertible linear function of the observations. 
Furthermore, it is easy to show that the determinant of the Jacobian of the transformation equals 1. (The 
Jacobian is triangular with 1 * s on the diagonal). Thus by Equation (3.4-1), we can substitute 

v* - z(t 1 ) - z^ ( t i ) (9.1-9) 


into Equation (9.1-8) to give 
N 

P(Z N |c) * n 12*^1-*/* exp{- \ [z(t^) - Ut,)]*^ 1 ^^) - z c (t,)]} (9.1-10) 

which Is Identical to Equation (9.1-7). We see that the derivation by Bayes factoring and the derivation 
using the Innovation give the same result. 
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9.1.4 Steady- State Form 

For many applications, we can use the time steady-state Kalman filter in the cost functional, resulting In 
major computational savings. This usage requires, of course, that the steady-state filter exist. We discussed 
the criteria for the existence of the steady-state filter In Chapter 7. The most Important criterion is 
obviously that the system be time-invariant. The rest of this section assumes that a steady-state form exists. 
When a steady-state form exists, two approaches can be taken to justifying Its use. 

The first justification Is that the steady-state form Is a good approximation If the time interval Is long 
enough. The time-varying filter gain converges to the steady-state gain with time constants at least as fast 
as those of the open-loop system, and sometimes significantly faster. Thus, if the maneuver analyzed is long 
compared to the system time constants, the filter gain would converge to the steady-state gain in a small por- 
tion of the maneuver time. We could verify this behavior by computing time-varying gains for representative 
values of 5 . If the filter gain does converge quickly to the steady-state gain, then the steady-state filter 
should give a good approximation to the cost functional. 

The second possible justification for the use of the steady-state filter involves the choice of the 
initial state covariance P 0 . The time-varying filter requires P„ to be specified. It is a common practice 
to set P 0 to zero. This practice arises more from a lack of better ideas than from any real argument that 
zero is a good value. It is seldom that we know the initial state exactly as implied by the zero covariance. 
One circumstance which would justify the zero initial covariance would be the case where the initial condition 
is Included in the list of unknown parameters. In this case, the Initial covariance is properly zero because 
the filter is conditioned on the values of the unknown parameters. Any prior information about the initial 
condition is then reflected In the prior distribution of £ instead of In P 0 . Unless one has a specific need 
for estimates of the initial condition, there are usually better approaches. 

We suggest that the steady-state covariance is often a reasonable value for the initial covariance. In 
this case, the tine-varying and steady-state filters are identical ; arguments about the speed of convergence 
and the length of the data interval are not required. Since the time-varying form requires significantly more 
computation than the steady-state form, the steady-state form is preferable except where it is clearly and 
significantly inferior. 

If the steady-state filter is used, Equation (9.1-7) becomes 
N 

L(c.Z n ) » exp{[ 2 (t i ) - - z^t,)]} (9.1-11) 

i*i 

where R Is the steady-state covariance of the innovation. In general, R is a function of $. The z^(t^) 
in Equation (9.1-11) comes from the steady-state filter, unlike the z^(ti) in Equation (9.1-7). We use the 
same notation for both quantities, distinguishing them by context. (The z^tti) from the steady-state filter 
Is always associated with the steady-state covariance R, whereas the z^(t^) from the time-varying filter is 
associated with the time-varying covariance Ri . ) 

9.1.5 Cost Function Discussion 


The maximum-1 ike 1 i hood estimate of £ is obtained by maximizing Equation (9,1-11) (or Equation (9.1-7) 
if the steady- state form is inappropriate) with respect to £. 

Because of the exponential in Equation (9.1-11), it is more convenient to work with the logarithm of the 
likelihood functional, called the log likelihood functional for short. The log likelihood functional is 
maximized by the same value of £ that maximizes the likelihood functional because the logarithm is a mono- 
tonic increasing function. By convention, most optimization theory is written in terms of minimization instead 
of maximization. We therefore define the negative of the log likelihood functional to be a cost functional 
which is to be minimized. We also omit the An (2tt) term from the cost functional, because it does not affect 
the minimization. The most convenient expression for the cost functional is then 

N 

J(4) - \ 2 WV - - 2 e Ct 1 U + \ N tn|R| (9.1-12) 

1»1 

If R Is known, then Equation (9.1-12) Is in a least-squares form. This Is sometimes called a prediction- 
error form because the quantity being minimized is the square of the one-step-ahead prediction error 
z( ) - Zc(t<). The term "filter error" is also used because the quantity minimized is obtained from the 
Kalman filter. 

Note that this form of the likelihood functional Involves the Kalman filter- not a smoother. There is 
sometimes a temptation to replace the filter In this cost function by a smoother, assuming that this will give 
Improved results. The smoother gives better state estimates than the filter, but the problem considered in 
this chapter Is not state estimation. The state estimates are an incidental side-product of the algorithm for 
estimating the parameter vector 5 . There are ways of deriving and writing the parameter estimation problem 
which involve smoothers (Cox and Bryson, 1980), but the direct use of a smoother In Equation (9.1-12) is 
simply Incorrect, 

For MAP estimates, we modify the cost functional by adding the negative of the logarithm of the prior 
probability density of £. IT the prior distribution of £ Is Gaussian with mean m^ and covariance W, the 
cost functional of Equation (9.1-12) becomes (Ignoring constant terms) 
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N 

J(0 * \ Y, Cz( V ' WWWV ' VV ] + 7 N tn l R l + 7 (« - * ■<) (9.1-13) 

1*1 

The filter-error forms of Equations (9.1-12) and (9.1-13) are parallel to the output-error forms of 
Equations (8.1*4) and (8.1-2). When there Is no process noise, the steady-state Kalman filter becomes an 
Integration of the system equations, and the Innovation covariance R equals the measurement noise covariance 
66*. Thus the output error equations of the previous chapter are special cases of the filter error equations 
with zero process noise. 


9.2 COMPUTATION 

The best methods for minimizing Equation (9.1-12) or (9.1-13) are based on the 6auss-Newton algorithm. 
Because these equations are so similar In form to the output-error equations of Chapter 8, most of the Chap- 
ter 8 material on computation applies directly or with only minor modification. 

The primary differences between computational methods for filter error and those for output error center 
on the treatment of the noise covariances, particularly when the covariances are unknown. Maine and II Iff 
(1981a) discuss the Implementation details of the filter-error algorithm. Tne Illff-Malne code, MMLE3 (Maine 
and Illff, 1980; and Maine, 1981), Implements the filter-error algorithm for linear contl nuous/dl sere te-t<me 
systems. 

We generally presume the use of the steady-state filter In the filter-error algorithm. Implementation Is 
significantly more complicated using the time-varying filter. 


9.3 FORMULATION AS A FILTERIN6 PROBLEM 

An alternative to the direct approach of the previous section Is to recast the parameter estimation prob- 
lem into the form of a filtering problem. The techniques of Chapter 7 then apply. 

Suppose we start with the system model 

x(t 0 ) - x 0 (9.3-la) 

x(t) - A(e)x{t) + BU)u(t) + Fn(t) (9.3-lb) 

z(t 1 ) ■ C(e)x(t i ) + D(€)u(t f ) + G ni (9.3-lc) 

This Is the same as Equation (9.0-1), except that here we explicitly Indicate the dependence of the matrices 
on 5. The problem Is to estimate $. 

In order to apply state estimation techniques to this problem, e must be part of the state vector. 
Therefore, we define an augmented state vector 


x # . (9.3-2) 

We can combine Equation (9.3-1) with the trivial differential equation 

i « 0 (9.3-3) 

to write a system equation with x a as the state vector. Note that the resulting system is nonlinear 1:i x a 
(because It has products of i and x), even though Equation (9.3-1) Is linear In x. 

In principle, we can apply the extended Kalman filter, discussed In Section 7.7, to the problem of esti- 
mating x a . Unfortunately, the nonlinearity In the augmented system Is crucial to the system behavior. The 
adequacy of the extended Kalman filter for this problem has seldom been analyzed In detail, Schweppe (1973, 
p. 433) says on this subject 

...the system Identification problem has been transformed Into a problem 
which has already been discussed extensively. 

The discussions are not terminated at this point for the simple reason that 
Part IV did not provide any "best" one way to solve a nonlinear state esti- 
mation problem. A major conclusion of Part IV was that the best way to 
proceed depends heavily on the explicit nature of the problem. System 
Identification leads to special types of nonlinear estimation problems, so 
specialized discussions are needed. 

...the state augmentation approach Is not emphasized, as the author feels 
that It Is much more appropriate to approach the system Identification 
problem directly. However, there are special cases where state augmentation 
works very wel 1 . 
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CHAPTER 10 


10.0 EQUATION ERROR METHOD FOR DYNAMIC SYSTEMS 

This chapter discusses the equation error approach to parameter estimation for dynamic systems. We will 
first define a restricted form of equation error, parallel to thi. treatments of output error and filter error 
In the previous chapters. This form of equation error Is a special case of filter error where there Is process 
noise, but no measurement noise. It therefore stands In counterpoint to output error, which Is the special 
case where there Is measurement noise, but no process noise. 

We will then extend the definition of equation error to a more general form. Some of the practical appli- 
cations of equation error do not fit precisely Into the overly restrictive form based on process noise only. 

In Its most general forms, the term equation error encompasses output error and filter error. In addition to 
the forms most coirmonly associated with the term. The primary distinguishing feature of the methods emphasized 
In this chapter is their computational simplicity. 


10.1 PROCESS-NOISE APPROACH 

In this section, we consider equation error In a manner parallel to the previous trectments of output 
error and filter error. The filter-error method treats systems with both process noise and measurement noise, 
and output e’Tor treats the special case of systems with measurement noise only. Equation error completes this 
triad of algorithms by treating the special case of systems with process noise only. 

The eq Jatlon-error method applies to nonlinear systems with additive Gaussian process noise. We will 
restrict the discussion of this section to pure discrete-time models, for which the derivation Is straightfor- 
ward. Mixed contlnuous/dlscrete-tlme models can be handled by converting them to equivalent pure discrete-time 
models. Equation error does not strictly apply to pure contlnucus-tlme models. (The problem becomes 
111 -posed). 

The general form of the nonlinear, discrete- time system model we will consider Is 

x(t 0 ) • x 0 (10.1-la) 

x(t 1+1 ) - f[x(t 1 ) i u(t 1 ).«] + Fn 1 1 « 0,1 N - 1 (10.1-lb) 

zt^) 3 g[x(tj),u(t^),0 i « o.i ,. ..,n (lo.i-ic) 

The process noise, n, Is a sequence of independent Gaussian random variables with zero mean and Identity 
covariance. The matrix F can be a function of although the simplified notation Ignores this possibility. 
It will prove convenient to assume that the measurements z(tj) are defined for 1 = G,...,N; previous 
chapters have defined them only for 1*1 N. 

10.1.1 Derivation 

The following derivation of the equation-error method closely parallels the derivation of the filter-error 
method In Section 9.1.3. Both are based primarily on application of the transformati jn of variables formula, 
Equation (3.4-1), starting from a process known to be a sequence of Independent Gaussian variables. 

By assumption, the probability density function of the process noise Is 

N-i 

p(n N ) • n (2-r)- l/2 expert,) ( 10 . 1 - 2 ) 

1*0 

where n# Is the concatenation of the n-j. We further assume that F Is Invertible for all permissible 
values of c; this assumption Is necessary to ensure that the problem Is well -posed. We define X^ to be the 
concatenation of the x(ti). Then, for each value of t. Xr Is an Invertible linear function of n^. The 
Inverse function Is 

n, ■ F* l C*(t 1+l ) - X t (t, +l )] (10.1-3) 

where, for convenience and for consistency with the notation of previous chapters, we have defined 

x 5 (t 1+1 ) * f[x(t 1 ),u(t 1 ),t] (10.1-4) 

U 

The determinant of the Jacobian of the Inverse transformation Is |FJ~ n because the Inverse transformation 
matrix Is block-triangular with F“ x In the diagonal blocks. Direct application of the transformatlon-of- 
varlables formula, Equation (3.4-1), gives 

N 

p(X N U) • n 1 2if FF— | _1 / a »*p(- ? [x(tj) - x^t,)]*^*)* 1 [x(t,) - S^t,)]} (10.1-5) 

1*1 

In order to derive a simple expression for p(Z^k)« we require that g be a continuous. Invertible func- 
tion of x for each value of c. The Invertlblllty Is critical to the simplicity of the equation-error 
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algorithm. This assumption, combined with the lack of measurement noise, means that we can reconstruct the 
state vector perfectly, provided that we know £. The Inverse function gives this reconstruction: 

VV ' (10.1-6) 

I* g Is not invertible, a recursive state estimator becomes Imbedded In the algorithm and we are again faced 
with something as complicated as the filter-error algorithm. For Invertible g, the transforma tlon-of- 
varlables formula. Equation (3.4-1), gives 


pu n io ■ n 

1«i 


« i.lii 'l " 


|2*FF*| " l / 2 exp{- | [x { (t,) - x^(t, )]*(FF*)“ l [x^(t^ ) - 

(10.1-7) 


where x^(ti) Is given by Equation (10.1-6), and 

^(t^ * f t^(t 1 . 1 )*u(t 1 _ l ),c] 


( 10 . 1 - 8 ) 


Most practical applications of equation error separate the problems of state reconstruction and parameter 
estimation. In the context defined above, this Is possible when g is not a function of £. Then Equa- 
tion (10.1-6) is also Independent of £; thus, we can reconstruct the state exactly without knowledge of £. 
Furthermore, the estimates of £ depend only on the reconstructed state vector and the control vector. There 
Is no direct dependence on the actual measurements 2 (til or on the exact form of the g-fuoctlon. This Is 
evident in Equation (10.1-7) because the Jacobian of g’ 1 Is Independent of £ and, therefore, Irrelevant to 
the parameter-estimation problem. In many practical applications, the state reconstruction is more complicated 
than a simple polntwise function as In Equation (10.1-6), but as long as the state reconstruction does not 
depend on £, the details do not matter to the parameter-estimation process. 

You will seldom (if ever) see Equation (10.1-7) elsewhere in the form shown here, which Includes the fac- 
tor for the Jacobian of g* 1 . The usual derivation Ignores the measurement equation and starts from the 
assumption that the state Is known exactly, whether by direct measurement or by some reconstruction. We have 
included the measurement equation only in order to emphasize tne parallels between equation error, output 
error, and filter error. For the rest of this section, we will assume that g is independent of £. We will 
specifically assume that the determinant of the Jacobian of g Is 1 (the actual value being Irrelevant to the 
estimator anyway), so that we can write Equation (10.1-7) in a more conventional form as 


p(z n U) 


II 

n |2wFF*p/ 2 exp j- \ [x(t,) - x e (t 1 )]*(FF*)- 1 [x(t i ) - x^)]} 


1*1 


(10.1-9) 


where 


x^tj) » f[x(t i . 1 ),u(t i . 1 ),0 (10.1-10) 

You can derive slight generalizations, useful in some cases, from Equation (10.1-7). 

The max 1 mum- 1 Ikeli hood estimate of £ is the value that maximizes Equation (10.1-9). As in previous 
chapters, it Is convenient to work In terms of minimizing the ncgative-log-1 Ike! ihood functional 

N 

0(c) * \ Yj WV - i‘ t (t 1 )]*(FF*)- 1 [x(t 1 ) - x e (t,)] + \ N in IFF* I (10.1-11) 

l-i 

If £ has a Gaussian prior distribution with mean m^ and covariance P, then the MAP estimate minimizes 
N 

JU) ■ | Y Wty * *^(*1 )]*( FF *) _x tx(t 1 ) - ^(t,)] + | N in|FF*| + | (C - m ? )*P' 1 (t - m { ) 

1,1 (10.1-12) 

10.1.2 Special Case of Filter Error 

For linear systems, we can also derive state-equation error by plugging Into the linear filter-error 
algorithm derived in Chapter 9. Assume that G Is 0; FF* is Invertible; C Is square. Invertible, and known 
exactly; and D Is known exactly. These are the assumptions that mean we have perfect measurements of the 
state of the system. 

The Kalman filter for this case Is (repeating Equation (7.3-11)) 

x(t 1 ) - C* 1 ^) - Du(tj)] • (10.1-13) 

and the covariance, Pi, of this filtered estimate Is 0. The one -step-ahead prediction Is 

x(t 1+l ) • ♦x(t < ) + tuft,) 


(10.1-14) 



From Equation (9.1-6) we have 


Qj ■ FF* 


(10.1-15) 


R 1 • CFF*C* 


(10.1-16) 


and thus Equation (9.1-12) becomes 
N 

JU) ■ \ Yj [*(»,) - \ N in |CFF*C*| (10.1-1?) 

l-i 

Eliminating Irrelevant |C| constants, we can redefine the cost function as 

N 

JU) * \ Yj t*U,) - K 4 (t,)]*(FF*)-»[x(t,) - i c (t f )] + \ N m|FF*| ( 10 . 1 - 18 ) 

i«l 

which Is In the form of Equation (10.1-11). Note that C and D play no role In this estimator, outside of the 
reconstruction of the state using Equation (10.1-13). 

10.1.3 Discussion 

The cost function defined by Equation (10.1-11) or (10.1-12) Involves a weighted square sum of the error 
that would be In the state equation, Equation (10.1-lb), if the noise term were omitted. The term "equation 
error" derives from this fact. This terminology Is rather vague, giving little hint as to what equation is 
meant. The output-error and filter-error methods descrlhed in previous chapters could, with equal validity, be 
categorized as methods involving minimizing the erro r of some equation. In spite of this potential ambiguity, 
the use of the term "equation error" is well-established, and the term is unlikely to be misinterpreted. The 
terms "state-equation error" and "observation-equation error," which we use in the following sections, are more 
definitive, but not widely used. 

The equation-error method is also referred to by several other names. The term "least squares" is some- 
times used to define the method, but this terminology is subject to misinterpretation. The large majority of 
the estimation methods used can be classified as least-squares methods. We suggest using the term "least 
squares" only to refer to this broad class of methods (as in the statement "equation error is a least squares 
method" ), never to precisely specify r method. The term "linear least squares" is somewhat more definitive (at 
least for the case in which f is a linear function of t) and has been used on occasion. Another t«rm often 
used is "regression" method (or, more definitively, "linear regression"). 

The terms "equation error" or "least squares" are often used to contrast this method with maximum- 
likelihood estimators. Such contrasts are inappropriate and misleading because equation error is a completely 
rigorous maximum-1 ikel ihood estimator for the problem as stated. The differences between equation error, 
output error, and filter error lie in the problem statements and assumptions, not in the statistical principles 
used nor in the rigor of the derivation. To disparage equation error on the basis that it is not maximum 
livelihood because it ignores measurement noise smacks more of snobbery than of honest evaluation. The neglect 
of measurement noise may, indeed, be a significant flaw for some applications, but this flaw Is irrelevant to 
the issue of whether equation error is maximum likelihood. 

A related common misconception is that equation-error estimates are biased, whereas output-error or 
filter-error estimates are asymptotically unbiased. To the contrary, equation error is asymptotically unbiased 
for the problem as stated; in many applications, the equation-error estimates are even unbiased for finite 
time. It 1$ true that equation error is biased in the presence of measurement noise, but output error is 
likewise biased in the presence of process noise. 

The principle illustrated here is universal: any estimator is biased (among other problems) when applied 
to systems that violate the assumptions used in deriving the estimator. This principle applies to all assump- 
tions, not just to the presence or absence of noise. Because any real system will violate any tractable set of 
assumptions, all estimators are actually biased. (All of our previous statements that given estimators a.j 
unbiased are based on idealized systems meeting the stated assumptions.) 

The unqualified statement that a given estimator is biased is, therefore, of little use in evaluating the 
estimator. Mo re pertinent issues include the questions of which assumptions are most severely violated by the 
actual system, and how sensitive the estimator is to these violations. The magnitude of the bias is a reason- 
able means of addressing thesv* questions, but the mere existence of a bias is not. 


10.2 GENERAL EQUATION ERROR FORM 

Many practical applications of equation error do not fit naturally into the restrictive definition of the 
previous section, which allows no measurement noise. There are several alternate definitions of equation error 
that accommodate these applications. These alternate definitions Involve apparently disparate statistical 
assumptions. The unifying theme, which justifies the use of the same terminology and computational tools for 
these various cases, is the form of the resulting cost function. In some cases, two different viewpoints and 
corresponding different assumptions about the sar* application can result In identical computations. 
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j We will* therefore, take the cost-function form as the general defining property of equation -error estl- 

\ mators. This form can arise from several different sets of statistical assumptions. This nonstatlstlcal , 

result-oriented approach to the definition helps us to avoid unnaturally contorting some problem statements 
to fcrce them to fit an overly rigid definition, when a more natjral problem statemen achieves the same 
result. 

To define a general equation-error estimator, we start with some equation, expressed as a function of the 
measurements and the unknown parameters, which should Ideally (Ignoring noise and modeling errors) be satisfied 
at every measurement time point. We write .he equation in the general form 

. h[i{.),M(.).t r c] - 0 1*1,2 N (10.2-1) 

j Sections 10.2.1 thruujo 10.2.? give specific conwon cases of such equations. 

j The equation-error estimate based on this equation Is then the value of c that minimizes the cost 

j function 

N 

1 J(0 * ? 2 h t*(-).u(.).t| .«]*. h[z(.),u(.).t 1 ,0 (10.2-2) 

•j ‘ 1-1 

i 

1 where W Is a positive semideflnSte weighting matrix. The definition assumes that the minimum exists and Is 

unique. 

In order to accommodate prior Information and unknown W matrices, we allow the form of Equation (10.2-2) 
to be extended to 


N 

JU) * | £ h[*(.).u(.).t r ?3*W h[z(.),u(.).t 1 ,0 ♦ \ N tn | 1 j + \ U • m { )*PU - 

1-1 (10.2-3) 

corresponding to Equation (10.1-12). The above definition Is broad enough to Include output error and filter 
error, as well as the equation-error estimators defined In Section 10.1. 

The estimators emphasized In this chapter have the particular property that the h dependence on z{.) 
ar.d u { ) Is restricted to one or two time points. The central statistical assumption that gives this property 
Is that there are perfect (no noise) measurements of the state. This assumption reduces the Kalman filter to 
the form of Equation (10.1-3), which eliminates the Integration of the state equation. With this assumption, 

Equation (10.1-3) Is the obvious optimal filter ever, for nonlinear state equations. We are also forced to 

assume that the process noise covariance FF* Is nonstngu 1 *,:-- & singular FF* combined with the perfect 
state measurements would give an Ill-posed problem. 

10.2.1 Di screte State-En ua tlon Error 

One specific case of the equation-error method is state-equation error. In this case, the specific form 
of Equation (10.2-1) derives from the state equation, ignoring the process noise. We will first consider 

state-equation error for discrete-time systems. The discrete-time state equation for a general nonlinear 

system, ignoring the process noise. Is 

x(t 1+l ) • fWt^.ut^),*] 1 • 0,1,. ...N - 1 (10.2-4) 

The h function based on this equatlor Is 

h[z(.),u(.),t r £] ■ x(t t ) - f[x(t 1o ) t u(t 1 . l ),0 1 - 1,2,. ..N (10.2-5) 

This form presumes that the x ( t *j ) can be reconstructed as a function of the i(t\) and u(tf). 

We recognize discrete-time state-equation error as the method ived In Section 10.1. Equation (10.1-12) 
(with Equation (10.1-10)) Is a special case of Equation (10.2-3) using equation (10.2-5) for h and FF* for 
W. Section 10.1 discussed the details of the statistical assumptions implicit In this form. 

Mote also that we can define a state-equation error method whether or not the state measurements are 
noise-free. The only requisite for a plausible state-equation error method Is that we have some estimate of 
the state to use In Equation (10.2-5). If the measurements are contaminated with noise, then the estimator Is 
noc a maxi mum- like 11 hood estimator and will be asymptotical ly biased. There are many practical circumstances, 
however, where a simple equation-error estimator Is preferable to the "optimal” alternatives. 

10.2.2 Continuous/Discrete State-Equation Error 

For a mixed continuous/ discrete- time system with additive process noise, the state equation Is 

x(t) - f[x(t),u(t),C] ♦ Fn(t) (10.2-6) 

The h function for a continuous/disc rate- time state-equation error method derives from evaluating the state 
equation at the measurement times tj and Ignoring the process noise: 

htzf.Kuf.htj,;] • xftj) - f[x(t 1 ),u(t 1 ),c] (10.2-7) 


- ' + 


,4 H V* 
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The use of tnis form In an equation -error method presumes that the state x(tj) can be reconstructed as a 
function of the z(t<|) and u(ti). This presumption Is Identical to that for discrete-time state-equation 
error, and It Implies the same conditions: there must be noise-free measurements of the state, independent of 
It is implicit that a known Invertible transformation of such measurements is statistically equivalent. As 
In the discrete-time case, we can define the estimator even when the measurements are noisy, but It rill no 
longer be a maxlmum-1 ikel 1 hood estimator. 

Equation (10.2-7) also presumes that the derivative x(t<) can be reconstructed from the measurements. 
Neglecting for the moment the statistical implications, note that we can form a plausible equation-error esti- 
mator using any reasonable means of approximating a value for x( t-f ) Independently of £. The simplest case of 
this Is when the observation vector Includes measurements of the state derivatives in addition to the measure- 
ments of the states. If such derivative measurements are not directly available, we can always approximate 
x(ti) by finite-difference differentiation of the state measurements, as in 




x(t Ul ) - 

*1+1 * Vl 


( 10 . 2 - 8 ) 


Both direct measurement and finite-difference approximation are used in practice. 

Rigorous statistical treatment is easiest for the case of finite-difference w iprox1mation$. To arrive at 
such a form, we write the state equation in integrated form as 


x( W 


*(t,) ♦ 



f[x{t),u(t),£]dt + 



Fn(t)dt 


(10.2-9) 


An approximate solution (not necessarily the best approximation) to Equation (10.2-9) is 

x(t. +l ) = x(t i ) + (t j+i - t i )f[x(t.),u(t.),c] + F d n. (10.2-10) 


where n, is a sequence of independent Gaussian variables, and F^ is the equivalent discrete F-matrix. 
Sections 6.2 and 7.5 discuss such approximations. 

Equation (10.2-10) is in the form of a discrete-time state equation. The discrete-time state-equation 
error method based on this equation uses 

h[z(.),u(.),t r O * x(t 1 ) - x(t^J - (t. - t i _ 1 )f[x(t i-1 ),u(t i _ 1 ) ,c] (10. 2-111 


Redefining h by dividing by tj - tj-* gives the form 

h[z(.;,u(.),t.^] = x(t.) - fCx(t i ) % u(t i ).e] 

where the derivative is obtained from the finite-difference formula 

x(t i ) - x(t i _ i ) 


x(tj) - 


* Vi 


( 10 . 2 - 12 ) 


(10.2-13) 


Other discrete-time approximations of Equation (10.2-9) result in different finite-difference formulae. 
The central -difference form of Equation (10.2-8) is usually better than the one-sided form of Equa- 
tion (10.2-13), although Equation (10.2-8) has a lower bandwidth. If the bandwidth of Equation (10.2-8) 
presents problems, a better approach than Equation (10.2-13) is to use 

h[ 2 (.),u(.),t i# c] = x(t W2 ) - f[x(t (10.2-14) 


where we have used the notation 


*1-1/2 ' 2 + Vi* 


(10.2-15) 


and 


X ^1-l/2^ 


X^) 


X(t 1-1> 


l 1 ’ t 1-X 


(10.2-16) 


There are several other reasonable finite -difference formulae applicable to tMs problem. 

Rigorous statistical treatment of the case in which direct state derivative measurements are available 
raises several complications. Furthermore, It is difficult to get a rigorous result in the form typically 
used-an equation-error method based on x measurements substituted into Equation (10.2-7). It is probably 
best to regard this approach as an equation -error estimator derived from plausible, but ad hoc, reasoning. 

We will briefly outline the statistical Issues raised by state derivative measurements, without attempting 
a complete analysis. The first problem Is that, for systems with white process noisj, the state derivative Is 
Infinite at every point In time. (Careful argument is required even to define the derivative.) We could avoid 
this problem by requiring the process noise to be band-limited, or by other means, but the resulting estimator 
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will not be in the desired form. A i.euristic explanation is that the x measurements contain implicit 
information abouc the derivative (from the finite differences), and simple use of the measured derivative 
ignores this information. A rigorous max imum- 1 ike li hood estimator would use both sources of information. 

This statement assumes that the x measurements and the finite-difference derivatives are ineependent data. 

It is conceivable that the x "measurements'' are obtained as sums of the x measurements (for instance, in 
an inert i? I navigation unit). Such cases are merely integrated versions of the finite-difference approach, not 
really comparable to cases of independent x measurements. 

The lack of a rigorous derivation for the state-equation error method with independently measured state 
derivatives does not necessarily mean that it is a poor estimator. If the information in the state derivative 
measurements is much better than the information in the finite-difference state derivatives, we can justify 
the approach as a good approximation. Furthermore, as expressed in our discussions in Section 1.4, an esti- 
mator does not have to be statistically derived to be a good estimator. For some problems, this estimator 
gives adequate results with low computational costs; when this result occurs, it is sufficient justification 
in itself. 

10-2.3 Observation-Equation Error 


Another specific case of the equation-error method is observation -equation error. In this case, the 
specific form of h comes from the observation equation, ignoring the noise. The equation is the same for 
pure discrete-time or mixed continuous/discrete-time systems. The observation equation for a system with 
additive noise is 


z(t i ) = gMt.J.udihc] + Gn i 
The h function based on this equation is 


(10.2-17) 


h[z(.),u(.),t.,s] * z(t.) - gWt.huttfhd (10.2-18) 

As in the case of state-equation error, observation-equation error requires measurements or reconstruc- 
tions of the state, because x(t-j) appears in the equation. The conwents in Section 10.2.1 about noise in the 
state measurement apply here also. Observation-equation error does not require measurements of the state 
derivative. 

The observation-equation error method also requires that there be some measurements in addition to the 
states, or the method reduces to triviality. If the states were the only measurements, the observation equa- 
tion would reduce to 


z(t.) = x(t.) (10.2-19) 

which has no unknown parameters. There would, therefore, be nothing to tjtimate. 

The observation-equation error method applies only to estimating parameters in the observation equation. 
Unknown parameters in the state equation do not enter this formulation. In fact, the existence of the state 
equation is largely irrelevant to th* method. 

This irrelevance perhaps explains why observation-equation error is usually neglected in discussions of 
estimators for dynamic systems. The method is essentially a direct application of the static estimators of 
Chapter 5, taking no advantage of the dynamics of the system (the state equation). From a theoretical view- 
point, it may seem out of place in this chapter. 

In practice, the observation-equation-error method is widely used, sometimes contorted to look like a 
state-equation-error method. The observation-equation-error method is often a competitor to an output-error 
method. Our treatment of observation-equation error is intended to facilitate a fair evaluation of such 
choices and to avoid unnecessary contortions into state-equation error forms. 


10.3 COMPUTATION 

We have previously mentioned that a unifying characteristic of the methods discussed in this chapter is 
their computational simplicity. We have not, however, given much detail on the computational issues. 

equation (10.2-3), which encompasses all equation-error forms, is in the form of Equation (2.5-1) if the 
weighting matrix W is known. Therefore, the Gauss-Newton optimization algorithm applies directly. Unknown 
* ma** Ices can be handled by the method discussed in Sections 5.5 and 8.4. 

In the most general definition of equation error, this is nearly the limit of what we can state about 
nputation. The definition of Equation (10.2-3) is general enough to allow output error and filter error as 
special cases. Both output error and filter error have the special property that the dependence of h on z 
and u can be cast In a recursive form, significantly lowering the computational costs. Because of this 
recursive form, the total computational cost is roughly proportional to the number of time points, N. The 
general definition of equation error also encompasses nonrecursive forms, which could have computational costs 
proportional to N 2 or higher powers. 

The equa cion-error methods discussed in this chapter have the property that, for each tf, the dependence 
of h on z(.) and u(.) is restricted to one or two time points. Therefore, the computational effort for each 
evaluation of h is independent of N, and the total computational cost is roughly proportional to N, In 
this regard, state-equation error and output -equation error are comparable to output error and filter error. 

For a completely general, nonlinear system, the computational cost .of state-equation error or output-equation 
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error is roughly similar to the cost of output error. (General nonlinear models are currently impractical for 
filter error without using linearized approximations.) 

In the large majority of practical applications, however, the f and g functions have special properties 
which make the computational costs of state-equation error and output-equation error far smaller than the com- 
putational costs of output error or filter error. 

The first property is that the f and g functions are linear in £. This property holds true even for 
systems described as nonlinear; the nonlinearity meant by the term "nonlinear system" is as a function of x 
and u-not as a function of Equation (1.3-2) is a simple example of a static system nonlinear in the 
input, but linear in the parameters. The output-error method can seldom take advantage of linearity in the 
parameters, even when the system is also linear in x and u, because the system response is usually a nonlinear 
function of z* (There are some significant exceptions in special cases.) 

State-equation error and output-equation error methods, in contrast, can take excellent advantage of 
linearity in the parameters, even when the system is nonlinear in x and u. In this situation, state-equation 
error and cutput-equation error meet the conditions of Section 2.5.1 for the Gauss-Newton algorithm to attain 
the exact mininiim in a single iteration. 

This is both a quantitative and a qualitative computational improvement relative to output error. The 
quantitative improvement is a division of the computational cost by the number of iterations required for the 
output-error methcd. The qualitative improvement is the elimination of the issues associated with iterative 
methods: starting values, convergence-testing criteria, failure to converge, convergence accelerators, multi- 

ple local solutions, and other issues. The most commonly cited of these benefits is that there is no need 
for reasonable starting values. You can evaluate the equations at any arbitrary point (zero 'S often con- 
venient) without affecting the result. 

Another simplifying property of f and g, not quite as universal, but true in the majority of cases, is 
that each element of Z affects only one element of f or g. The simplest example of this is a linear system 
where the unknown parameters are individual elements of the system matrices. With this structure, if we con- 
strain W to be diagonal. Equation (10.2-3) separates into a sum of independent minimization problems with 
scalar h, one problem for each element of h. If t is the number of elements of the h-vector, we now have 
8 independent functions in the form of Equation (10.2-3), each with scalar h. Each element of z affects 
one and only one of these scalar functions. 

This partitioning has the obvious benefit, common to most partitioning algorithms, that the sum of the 
t-problems with scalar h requires less computation than the unpartitioned vector problem. The outer-product 
computation of Equation (2.5-11), often the most time-consuming part of the algoritnm, is proportional to the 
square of the number of unknowns and to i. Therefore, if the unknowns are evenly distributed among the I 
elements of h, the computational cost of the vector problem coulc be as much as i 3 times the cost of each of 
the scalar problems. Other portions of the computational cost and overhead will reduce this factor somewhat, 
but the improvement is still dramatic. 

Another benefit of the partitioning is that it allows us to avoid iteration when the noise covariances are 
unknown. With this partitioning, the minimizing values of Z are independent of W. The normal role of W 
is in weighing he importance of fitting the different elements of the h. One value of z might fit one 
element of h best, while another value of z fits another element of h best; W establishes how to strike 
a compromise among these conflicting aims. Since the partitioned problem structure makes the different ele- 
ments of h Independent, W is largely irrelevant. Therefore we can estimate the elements of z using any 
arbitrary value of W (usually an identity matrix). If we want an estimate of W, we can compute it after we 
estimate the other unknowns. 

The combined effect of these computational improvements is to make the computational cost of the state- 
equation error and output-equation error methods negligible in many applications. It is common ^or the compu- 
tational cost of the actual equation-error algorithm to be dwarfed by the overhead costs of obtaining the data, 
plotting the results, and related computations. 


10.4 DISCUSSION 

The undebated strong points of the state-equation-error and output-equation-error methods are their sim- 
plicity and low computational cost. Most important is that Gauss-Newton gives the exact minimum of the cost 
function without iteration. Because the methods are noniterative, they require no starting estimates. These 
methods have been used in many applications, sometimes under different names. 

The weaknesses of these methods stem from their assumptions of perfect state measurements. Relatively 
small amounts of noise in the measurements can cause significant bias errors in the estimates. If a measure- 
ment of some state is unavailable, or if an instrument fails, these methods are not directly applicable (though 
such problems are sometimes handled by state reconstruction ?lgorithms). 

State-equation-error and output-equation-error methods can be used with either of two distinct approaches, 
depending upon the application. The first approacn is to accept the problem of measurement-noise sensitivity 
and to emphasize the computational efficiency of the method. This approach Is appropriate when computational 
cost is a more important consideration than accuracy. 

For example, state-equation error and output-equation error methods are popular for obtaining starting 
values for iterative procedures such as output error. In such applications, the estimates need only be accu- 
rate enough to cause the iterative methods to converge (presumably to better estimates). 

Another coercion use for state-equation error and output-error is to select a model from a large number of 
candidates by estimating the parameters in each candidate model. Once the model form Is selected, the rough 
parameter estimates can be refined by some other method. 
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The second approach to usinq state-equation -error or output-equation-error methods is to spend the time 
and effort necessary to get accurate results from them, which first requires accurate state measurements with 
low noise levels. In many applications of these methods, most of the work lies in filtering the data and 
reconstructing estimates of unmeasured states. (A Kalman filter can sometimes be helpful here, provided that 
the filter does not depend upon the parameters to be estimated. This condition requires a special problem 
structure.) The total cost of obtaining good estimates from these methods, including the cost of data pre- 
processing, may be comparable to the cost of more complicated iterative algorithms that require less 
preprocessing. The trade-off is highly dependent on application variables such as the required accuracy of 
the estimates, the quality of the available instrumentation, and the existence of independent needs for 
accurate state measurements. 
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CHAPTER 11 


1U0 ACCURACY OF THE ESTIMATES 

Parameter estimates from real systems are, by their nature, imperfect. The accuracy of the estimates is 
a pervasive issue In the various stages of application, from the problem statement to the evaluation and use 
of the results. 

We introduced the subject of parameter estimation in Section 1.4, using concepts of errors in the esti- 
mates and adequacy cf the results. The subsequent chapters have largely concentrated on the derivation of 
Igorithms. These derivations are all related to accuracy issues, based on the definitions and discussions in 
hapter 4. Hoover, the questions about accuracy have been largely overshadowed by the details of deriving and 
implementing the algorithms. 

In this chapter, we return the emphasis to the critical issue of accuracy. The final judgment of the 
parameter estimation process for a particular application is based on the accuracy of the results. We examine 
the evaluation of the accuracy, factors contributing to inaccuracy, and means of improving accuracy. A truly 
comprehensive treatment of the subject of accuracy is impossible. We restrict our discussion largely to 
generic issues related to the theory and methodology of parameter estimation. 

To make effective use of parameter estimates, we must have some gauge of their accuracy, be it a statisti- 
cal measure, an intuitive guess, or some other source. If we absolutely cannot distinguish the extremes of 
accurate versus worthless estimates, we must always consider the possibility that the estimates are worthless, 
s .n which case the estimates could not be used in any application in which their validity was important, 
therefore, measures of the estimate accuracy are as important as are the estimates themselves. Various means 
)f judging the accuracy of parameter estimates are in current use. 

We will group the uses for measures of estimate accuracy into three general classes. The first class of 
use is in planning the parameter estimation. Predictions of the estimate accuracy can be used to evaluate the 
adequacy of the proposed experiments and instrumentation system for the parameter estimation on the proposed 
model. There are limitations to tnis usage because it involves predicting accuracy before the actual data are 
obtained. Unexpected problems can always cause degradation of the results compared to the predictions. The 
accuracy predictions are most useful in identifying experiments that have no hope of success. 

The second use is in the parameter estimation process itself. Measures of accuracy can help detect 
various problems in the estimation, from modeling failures, data problems, program bugs, or other sources. 
Another facet of this class of use is the comparison of different estimates. The comparisons can be between 
two different models or methods applied to the same data set, between estimates from independent data sets, or 
between predictions and estimates from the experimental data. In any of these events, measures of accuracy 
can help determine which of the conflicting values is best, or whether some compromise between them should be 
considered. Comparison of the accuracy measures with the differences in the estimates is a means to determine 
if the differences are significant. The magnitude of the observed differences between the estimates is, in 
itself, an indicator of accuracy. 

The third use of measures of accuracy is for presentation with the final estimates for the user of the 
results. If the estimates are to be used in a control system design, for instance, knowledge of their accuracy 
is useful in evaluating the sensitivity of the control system. If the estimates are to be used by an explicit 
adaptive or learning control system, then it is important that the accuracy evaluation be systematic enough to 
be automatically iirr’emented. Such immediate use of the estimates precludes the intercession of engineering 
judgment; the ev* jacion of the estimates must be entirely automatic. Such control systems must recognize poor 
results and sui*.ouly discount them (or ensure that they never occur— an overly optimistic goal). 

The single most critical contributor to getting accurate parameter estimates in practical problems is the 
analyst's understanding of the physical system and the instrumentation. The most thorough knowledge of param- 
eter estimation theory and the use of the most powerful techniques do not compensate for poor understanding of 
the system. This statement relates directly to the discussion in Chapter 1 about the "black box" identifica- 
tion problem and the roles of Independent knowledge versus system identification. The principles discussed in 
this chapter, although no substitute for an understanding of the system, are a necessary adjunct to such 
understanding. 

Before proceeding further, we need to review the definition of the term "accuracy" as it applies to real 
data. A system is ievsr described exactly by the simplified models used for analysis. Regardless of the 
sophistication of t* .nodel, unexplained sources of modeling error will always remain. There is no unique, 
correct model . 

The cc' p«. of accuracy is difficuU to define precisely If no correct model exists. It is easiest to 
approach by considering the problem in two parts: estimation and modeling. For analyzing the estimation prob- 
lem, we assume that the model describes the system exactly. The definition of accuracy Is then precise and 
quant 1 active . Many results arc available in the subject area of estimation accuracy. Sections 11.1 and 11.2 
d1sc> i several of them. 

The modeling problem addresses the question of whether the form of the model can describe the system 
adequately for Its Intended use. There is little guide from the theory in this area. Studies such as those 
of Gup a, Hall, and TranMa (1978), Fiske and Price (1977), and Aka Ike (1974), discuss selection of the best 
model from a set of candidates, but do not consider the more basic issue cf defining the candidate models. 
Section 11.4 considers this point in more detail. 

For the ir. st part, the determination of model adequacy is based on engineering judgment and problem- 
specific analysis relying heavily on the analyst's understanding of the physics of the system. In some cases, 
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we can test model adequacy by demonstration: if we try the model and it achieves its purpose, it was obviously 
adequate. Such tests are not always practical, however. This method assumes, of course, that the test was 
comprehensive. Such assumptions should not be made lightly; they have cost lives when systems encountered 
untested conditions. 

After considering estimation and modeling as separate problems, we need to look at their interactions to 
complete tie discussion of accuracy. We need to consider the estimates tnat result from a model judged to be 
adequate, although not exact. As in the modeling problem, this process involves considerable subjective judg- 
ment, although we can obtain some quantitative results. 

We can examine some specific, postulated sources of modeling error through simulations or analyses that 
use more corolex models than are practical or desirable in the parameter estimation. Such simulations or 
analyses can include, for example, models of specific, postulated instrumentation errors (Hodge and Bryant, 
1978; and Sorensen, 1972). Maine and II iff (1981b) present some more general, but less rigorous, results. 


11.1 CONFIDENCE REGIONS 

The concept of a confidence region is central to the analytical study of estimation accuracy. In general 
terms, a confidence region is a region within which we can be reasonably confident that the true value of s 
lies- Accurate estimates correspond to small confidence regions for a given level of confidence. Note that 
small confidence regions imply large confidence; in order to avoid this apparent inversion of terminology, the 
term “uncertainty region" is sometimes used in place of the term “confidence region." The following subsec- 
tions define confidence regions more precisely. 

For continuous, nonsingular estimation problems, the probability of any point estimate's being exactly 
correct is zero. We need a concent such as the confidence region to make statements with a nonzero confidence. 
Throughout the discussion of confidence regions, we assume that the system model is correct; that is, we assume 
that s has a true value lying ii the parameter soace. In later sections we will consider issues relating to 
modeling error. 

11.1.1 Random Parameter Vector 


Let us consider first the case in which E, is a random variable with a known prior distribution. This 
situation usually implies the use of an MAP estimator. 

In this case, £ has a posterior distribution, and we can define the posterior probability that i lies 
in any fixed region. Although we will use the posterior distribution of £ as the context for this discus- 
sion, we can equally well define prior confidence regions. None of the following development depends upon our 
working with a posterior distribution. For simplicity of exposition, we will assume that the posterior distri- 
bution of C bas a density function. The posterior probability that £ lies in a region R is then 

P(R) = / p(£|Z)d|c| (11.1-1) 

We define R to be a confidence region for the confidence level a if P(R) = a, and no other region 
with the same probability is smaller than R. We use the volume of a region as a measure of its size. 

Theorem 11.1 Let R be the set of all points with pU|Z) > c, where c is 
a constant. Then R is a confidence region for the confidence level 
a=P(R). 

Proof Let R be as defined above, and let R' be any other region with 
P(R ) = a. We need to prove that the vc 7 ume of R' must be greater than or 
equal to that of R. We define T ; finfi', S* RnR', and S' = R' n R. 

Then T, S, and S' are disjoint, R = T u S, and R' * T u S'. Because 
S c R, we must have p(s|Z) z c everywhere in S. Conversely, S' c R, so 
pU]Z) < c everywhere in S’. In order for P(R') " P(R), we must have 
P(S') = P(S). Therefore, the volume of S' must be greater than or equal to 
that of S. The volume of R' must then be greater than that of R, com- 
pleting the proof. 

It is often convenient to characterize a closed region by its boundary. The boundaries of the confidence 
regions defined by Theorem 11.1 are isoclines of the posterior density function p(cjZ). 

We can write the confidence region derived in the above theorem as 

R = {x: p 5 | Z (x|Z) > c} (11.1-2) 

We must use the full notation for the probability density function to avoid confusion in the following manipu- 
lations. For consistency with the following section, it is convenient to re-express the confidence region in 
terms of the density function of the error. 


e ■ c - i (11.1-3) 

The estimate l is a deterministic function of Z; therefore, Equation (11.1-3) trivially gives 

p 5 | 2 (*|z> = P e |z (x * «l z > 


(11.1-4) 
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Substituting this Into Equation (11.1-2) gives the expression 

R = {x: p e | Z (x - ||Z) a c) (11.1-5) 

Substituting x + £ for x in Equation (11.1-5) gives the convenient form 

R s it + P e | Z (x|Z) * c] (11.1-6) 

This form shews the boundaries of the confidence regions to be translated isoclines of the error-density 
function. 

Exact determination of the confidence regions is impractical except in simple cases. One such case occurs 
when £ is scalar and p(£|Z) is unimodal. An isocline then consists of two points, and the line segment 
between the two points is the confidence region. In this one-dimensional case, the confidence region is often 
called a confidence interval. 

Another simple case occurs when the posterior density function is in some standard family of density 
functions expressible in closed form. This is most commonly the family of Gaussian density functions. An iso- 
cline of a Gaussian density function with mean m and nonsingular covariance A is a set of x values 
satisfying 

(x - m)*A" 1 (x - m) * c (11.1-7) 


This is the equation of an ellipsoid. 

For problems not fitting into one of these special cases, we usually must make approximations in the com- 
putation of the confidence regions. Section 11.1.3 discusses the most common approximation. 

11.1.2 Nonrandom Parameter Vector 


When £ is simply an unknown parameter with no random nature, the development of confidence regions is 
more oblique, but the result is similar in form to the results of the previous section. The same comments 
apply when we wish to ignore any prior distribution of £ and to obtain confidence regions based solely on 
the current experimental data. These situations usually imply the use of MLE estimators. 

In neither of these situations can we meaningfully discuss the probability of £ lying in a given region. 
We proceed as follows to develop a substitute concept: the estimate j is e function of the observation Z, 

which has a probability distribution conditioned on £. Therefore, we can define a probability distribution 
of £ conditioned on £. We will assume that this distribution has a density function p^. 

For a given value of £, the isoclines of p£|^ define boundaries of confidence regions for £. Let R x 
be such a confidence region, with confidence level a. 

R i = {x: Pg| € (x IO * c} (11.1-8) 

It is convenient to define R x in terms of the error density function p e | ^ , using the relation 

P|| ? (x|t) * P e(£ U - -U) (11.1-9) 


This gives 


R x - U “ x: P eU (xU) > c) ( 11 . 1 - 10 ) 

The estimate £ has probability a of being in R 2 . For this chapter, we are more interested in the 
situation where we know the value of £ and seek to define a confidence region for £, which is unknown. We 
can define such a confidence region for £, given £, in two steps, starting with the region R x . 


The first step is to define a region R 2 which is a mirror image of k l . A point £ - x in the region 

R x reflects onto the point l + x in R 2 , as shown in Figure (11.1-1). We can thus write R 2 as 

R 2 Ml + x: p e j c (x|c) > c) (11.1-11) 

This reflection interchanges £ and £\ therefore, £ is in R 2 if and only if^ £ is in R x . Because there is 

probability a that f lies in R x> there is the same probability a that | lies in R 2 . 


To be technically correct, we must be careful about the phrasing of this statement. Because the true value 
5 is not random, it makes no $er>se to say that £ has probability a of lying in R 2 . The randomness is in 
the construction of the region ? 2 because R 2 depends on the estimate |, which depends In turn on the noise- 
contaminated observations. We can sensibly say that the region R 2t constructed In this manner, has probability 
a of covering the true value £. This concept of a region covering the fixed point £ replaces the concept of 
the point £ lying in a fixed region. The distinction is more important in theory than in practice. 


Although we have defined the region R 2 in principle, we cannot construct the region from the data avail- 
able because R 2 deperds on the value of £, which is unknown. Our next step is to construct a region R 3 , 
which approximates R 2 , but does not depend on the true value of £. We base the approximation on the assump- 
tion that p e | 4 Is approximately invariant as a function of £; that Is 
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( 11 . 1 - 12 ) 


p e | c (*U) * P e | t (x|5 + «) 

This approximation is unlikely to be valid for Urge values of 5 except in simple cases. For small values 
of 6 , the approximation is usually reasonable. 

Ue define the confidence region R* by applying this approximation to Equation (11.1-11), using l - e 
for 6 . 

R s * U + *: P e | 5 (x|C) 2 c) (11.1-13) 

The region R 3 depends only on c. PeU* an( * * he arbitrary constant c. The function p e ir is presumed known 
from the start, and £ is the estimate computed by the methods described in previous chapters. In principle, 
we have sufficient information to compute the region R 3 . Practical application requires either that p e i^ 
be in one of the simple forms described in Section 11.1.1, or that we make further approximations as discussed 
in Section 11.1.3. 

If l - 5 Is small (that is, if the estimate is accurate), then R 3 will likely be a close approximation 
to R 2 . If £ - s is Urge, then the approximation Is questionable. The result is that we are unable to 
define large confidence regions accurately except in special cases. Ue can tell that the confidence region is 
large, but its precise size and shape are difficult to determine. 

Note that the confidence region for nonrandom parameters, defined by Equation (11.1-13), is almost iden- 
tical in form to the confidence region for random parameters, defined by Equation (11.1-6). The only differ- 
ence in the form is what the density functions are conditioned on. 

ll.l.o Gaussian Approximation 

The previous sections have derived the boundaries of confidence regions for both random and nonrandom 
parameter vectors in terms of isoclines or probability density functions of the error vector. Except in 
special cases, the probability density functions are too complicated to allow practical computation of the 
exact isoclines. Extreme precision in the computation of the confidence regions is seldom necessary; we have 
already made approximations in the definition of confidence regions for nonrandom parameters In this section, 
we introduce approximations which allow relatively easy computation of confidence regions. 

The central idea of this section is to approximate the pertinent probability density functions by Gaussian 
density functions. As discussed in Section 11.1.1, the isoclines of Gaussian density functions are ellipsoids, 
which are easy to compute. He call these “confidence ellipsoids" or "uncertainty ellipsoids." In many cases, 
we can justify the Gaussian approximation with arguments that the distributions asymptotically approach 
Gausslans as the amount of data increases. Section 5.4.2 discusses some pertinent asymptotic results. 

A Gaussian approximation is defined by its mean and covariance. Ue will consider appropriate choices for 
the mean and covariance to make the Gaussian density function a reasonable approximation. An obvious possibil- 
ity Is to set the mean and covariance of the Gaussian approximation to match the mean and covariance of the 
original density function; we are often forced to settle for approximations to the mean and covariance of the 
original density function, the exact values being impractical to compute. Another possibility is to use 
Equations (3.5-17) and (3.5-18). Ue will illustrate the use of both of these options. 

Consider first the case of an MLE estimator. Equation (1U-13) defines the confidence region. Ue will 
use covariance matching to define the Gaussian approximation to Pelt- The exact mean and covariance of 
p e |£ are difficult to compute, but there are asymptotic results which give reasonable approximations. 

Ue use zero as an approximation to the mean of Pele; this approximation is based on MLE estimators being 

asymptotically unbiased. Because MLE estimators are efficient, the Cramer-Rao bound gives an asymptotic 

approximation for the covariance of p e U as the Inverse of the Fisher information matrix M(s). We can use 
either Equation (4.2-19) or (4.2-24) as equivalent expressions for the Fisher information matrix. Equa- 
tion (5.4-11) gives the particular form of MU) for static nonlinear systems with additive Gaussian noise. 

Both i and MU) are readily available In practical application. The estimate £ is the primary output 
of a parameter estimation program, and most MLE parameter-estimation programs compete M(£) or an appro mation 
to It as a by-product of iterative minimization of the cost function. 

Now consider the case of an MAP estimator. We need a Gaussian approximation to p(e|z). Equa- 
tions (3.5-17) and (3.5-18) provide a convenient basis for such an approximation. By Equation (3.5-17), we set 

the mean of the Gaussian approximation equal to the point at which p(e|z) Is a maximum; by definition of the 
MAP estimator, this point is zero. 

We then set the covariance of the Gaussian approximation to 

A « [-v* in p(elz)]" 1 (11.1-14) 

evaluated at K ■ £. For static nonlinear systems with additive Gaussian noise, Equation (11.1-14) reduces to 
the form of Equation (5.4-12), which we could also have obtained by approximate covariance matching arguments. 
This form for the covariance is the same as that used in the MLE confidence ellipsoid, with the addition of the 
prior covariance term. As the prior covariance goes to Infinity, the confidence ellipsoid for the MAP estimator 
approaches that for the MLE estimator, as we would anticipate. 

Both the MLE and MAP confidence ellipsoids take the form 

(x - c^A’Mx - Z) * c 


(11.1-15) 
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where A Is an approximation to the error-covariance matrix. We have suggested suitable approximations In tne 
above paragraphs, but most approximations to the error covariance are equally acceptable. The choice Is 
usually dictated by what Is conveniently available In a given program. 

11.1.4 Nonstatistical Derivation 

We can alternately derive the confidence ellipsoids for MAP and MLE estimators from a nonstatistical view- 
point. This derivation obtains the same result as the statistical approach and is easier to follow. Compari- 
son of the ideas used in the statistical and nonstatlstical derivations reveals the close relationships between 
the statistical cnaracterl sties of the estimates and the numerical problems of computing them. The nonstatls- 
tical approach generalizes easily to estimators and models for which precise statistical descriptions are 
difficult. 

The nonstatistical derivation presumes that the estimate is defined as the minimizing point of some cost 
function. We examine the shape of this cost function as it affects the numerical minimization problem in the 
area of the minimum. For current purposes, we are not concerned with start-up problems, isolated local minima, 
and other problems manifested far from the solution point. A relatively flat, ill -defined minimum corresponds 
to a questionable estimate; the extreme case of this is a function without a discrete local minimum point A 
steep, well-defined minimum corresponds to a reliable estimate. 

With this justification, we define a confidence region to be the set of points with cost-function values 
less than or equal to some constant. Different values of the constant give different confidence levels. The 
boundary of such a region is an isocline of the cost function. 

We then approximate the cost function in the neighborhood of the minimum by a quadratic Taylor-series 
expansion about the minimum point. 

0 ( 5 ) * 0(c) + \ (C - e)*[v*Jtf)](e - 5 ) (11.1-16) 

The isoclines of this quadratic approximation are the confidence ellipsoids. 

U - e)*[v*J(c)]U - i) • C (11.1-17) 

The second gradient of an MLE or MAP cost function is an asymptotic approximation to the appropriate error 
covariance. Therefore, Equation (11.1-17) gives the same shape confidence ellipsoids as we previously derived 
on a statistical basis. In practice, the Gauss-Newton or other approximation to the second gradient is 
usually used. 

The constant c determines the size of the confidence ellipsoid. The nonstatlstical derivation gives no 
obvious basis for selecting a value of c. The value c * 1 gives the most useful correspondence to the 
statistical derivation, as we will see in Section 11.2.1. 

Figures (11.1-2) and (11.1-3) illustrate the construction of one-dimensional confidence ellipsoids using 
the nonstatistical definition. 


11.2 ANALYSIS OF THE CONFIDENCE ELLIPSOID 

The confidence ellipsoid gives a comprehensive picture of the theoretically likely errors in the estimate. 
It is difficult, however, to display the information content of the ellipsoid on a two-dimensional sheet of 
paper. In the applications we most commonly work on, there are typica^v 10 to 30 unknown parameters; that is, 
the ellipsoid is 10- to 30-dimensional. We can print the covariance matrix which defines the shape of the 
ellipsoid, but it is difficult to draw useful conclusions from such a presentation format. The problem of 
meaningful presentation is further compounded when analyzing hundreds of experiments to obtain parameter 
estimates under a wide variety of conditions. 

In the following sections, we discuss simplified statistics that characterize Important features of the 
confidence ellipsoids in ways that are easy to describe and present. The emphasis in these statistics is on 
reducing the dimensionality of the problem. Many Important questions about accuracy reduce to one-dimensional 
forms, such as the accuracy of the estimate of each element of the parameter vector. 

All of the statistics discussed here are functions of the matrix A, which defines the shape of the confi- 
dence ellipsoid. We have seen above that A is an approximation to the error-covariance matrix. These two 
viewpoints of A will provide us with geometrical and statistical Interpretations. A third interpretation 
comes from viewing A as the Inverse of the second gradient of the cost function. In practice, A Is usually 
computed from the Gauss-Newton or other convenient approximation to the second gradient. 

These statistics are closely linked to some of the basic sources of estimation errors and difficulties. 

We will illustrate the discussion with Idealized examples of these classes of difficulties. The exact means 
of overcoming such difficulties depends on the problem, but the first step Is to understand the mechanism 
causing the difficulty. In a surprising number of applications, the major difficulties are cases of the simple 
idealizations discussed here. 

11.2.1 Sensitivity 

The sensitivity is the simplest of the statistics relating to the confidence ellipsoid. Although the sen- 
sitivity has both a statistical and a nonstatlstical Interpretation, the use of the statistical Interpretation 
is relatively rare. The term "sensitivity" comes from the nonstatlstical interpretation, which we will discuss 
first. 
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From the nonstatistlcal viewpoint, the sensitivity Is a measure of how much the cost-function value 
changes for a given change In a scalar parameter value. The most common definition of the sensitivity with 
respect to a parameter Is the second partial derivative of the cost friction with respect to the parameter. 

s . (11.2-1) 

For the purposes of this chapter, we are Interested in the sensitivity evaluated at the minimum point of the 
cost function; we will take this as part of the definition of the sensitivity. 

The In Equation (11.2-1) can be any scalar function of the £ vector. In most cases, £-j is one of 
the elements of the £ vector. For simplicity, we will assume for the rest of this section that £f is the 
ith element of £. Generalizations are straightforward. When £.j Is the ith element of £, the second 
partial derivative with respect to ^ is the 1th diagonal element of the second-gradient matrix. 

S 1 * t v c J (0]ii * (U.2-2) 

The sensitivity has a simple geometric Interpretation based on the confidence ellipsoid.^ Use the value 
c * 1 In Equation (11.1-17) to define a confidence ellipsoid. Draw a line passing through £ (the center of 
the ellipsoid) and parallel to the £^ axis. The sensitivity with respect to £j is related to the distance, 
I i • from the center of the ellipsoid to the Intercept of this line and the ellipsoid. We call this distance 
the insensitivity with respect to £j. Figure (11.2-1) shows the construction of the Insensitivities with 
respect to £ x and £ 2 on a two-dimensional example. The relationship oetween the sensitivity and the Insensi- 
tivity is 


I, ■ (S,)' 1 ^ * A 1 / 2 (11.2-3) 

which follows immediately from Equation (11.1-17) for the confidence ellipsoid, and Equation (11.2-1) for the 
sensitivity. 

We can rephrase the geometric interpretation of the insensitivity as follows: the insensitivity with 

respect to £j Is the largest change that we can make in the ith element of £ and still remain within the 
confidence ellipsoid. All other elements of £ are constrained to remain equal to their estimates values 
during this search; that is, the search is constrained to a line parallel to the £i axis passing through £. 

From the statistical viewpoint, the Insensitivity with respect to £; is an approximation to the standard 

deviation of e-j, the corresponding component of the error, condit.oned on all of the other components of the 

error. We can see this by recalling the results from Chapter 3 on conditional Gaussian distributions. If the 

covariance of e is A, then the covariance of e^ conditioned on all of the other components Is 

[(A" l )li] _1 ; therefore, the conditional standard deviation is [(a" 1 )^]" 1 ' 2 . From Equations (11.2-2) 
and (11.2-3), we can see that this expression equals the insensitivity. Note that the conditioning on the 
other elements in the statistical viewpoint corresponds directly to the constraint on the other elements In the 
geometric viewpoint. 

A sensitivity analysis will detect one of the most ob/ious kinds of estimation difficulty- parameters 
which have little or no effect on the system response. Ii a parameter has no effect on the system response, 
then It should be obvious that the system response data give no basis for an estimate of the parameter; In 
statistical terms, the system is unidentifiable. Similarly, if a parameter has little effect on the system 
response, then there is little basis for an estimate of the parameter; we can expect the estimates to be 
Inaccurate. 

Checking for parameters which have no effect on the system response may seem like an academic exercise, 
considering that practical problems would not be likely to have such Irrelevant parameters. In fact, this 
seemingly trivial difficulty Is extremely cornnon in practical applications. It can arise from typographical 
or other errors in Input to computer programs. Perhaps the most common example of this problem is attempting 
to estimate the effect of an Input which Is Identically zero. The input might either be validly zero. In which 

case its effect cannot be estimated, or the Input signal might have been destroyed or misplaced by sensor or 

programming problems. 

The sensitivity Is a reasonable indicator of accuracy only when we are estimating a single parameter, 
because the estimates of other parameters are never exact, as the sensitivity analysis assumes. The sensitiv- 
ity analysis ignores all effects of correlation between parameters; we can evaluate the sensitivity with 
respect to a parameter without even knowing what other parameters are being estimated. When more than one 
parameter is estimated, the sensitivity gives only a lower bound for the error estimate. The error band is 
always at least as large as the sensitivity regardless of what other parameters are estimated; correlation 
effects between parameters can increase, but never decrease, the error band. In other words, high sensitivity 
Is a necessary, but not sufficient, condition for an accurate estimate. 

In practice, correlation effects tend to increase the error band so much that the sensitivity Is virtually 
useless as an Indicator of accuracy. The sensitivity analysis is usually useful only for detecting the problem 

of completely irrelevant parameters. The sensitivity will not indicate when the effect of a parameter Is 

Indistinguishable from the effects of other parameters, a more common p iblem. 

11.2,2 Correlation 

We noted In the previous section that correlations among parameters result In much larger error bands than 
Indicated by the sensitivities alone. The Inadequacy of the sensitivity as a measure of estimate accuracy has 
led to the widespread use of the statistical correlations to Indicate accuracy. We will see In this section 
that the correlations also give an Incomplete picture of the accuracy. 
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The statistical correlation between two error components and ej Is defined to be 

corr(e r ej) ■ E{e 1 ej}//[E{e|)£{ep j 

assuming that the means of ei and ej are zero. In terms of A, the covariance matrix of e, the correlation 
is 


co rr(e 1f ej) = A iJ //(A^Ajp (11.2-5) 

Geometrically, the correlations are related to the eccentricity of the confidence ellipsoid. If the sen- 
sitivities with respect to all of the unknown parameters are equal (which we can always arrange by a scale 
change), and If the correlations are all zero, then the confidence ellipsoid Is spherical. As the magnitudes 
of the correlations become larger, the eccentricity of the scaled ellipsoid Increases. The magnitude of the 
correlations can never exceed 1, except through approximations or round-off errors In the computation. 

The definition above Is for the unconditional, or full correlations. Whenever the term correlation 
appears without a modifier, It implicitly means the unconditional correlation. We can also define conditional 
correlations, although they are less commonly used. The definition of the conditional correlation Is Identical 
to that of the unconditional correlations, except that the expected values are all conditioned on all of the 
parameters other than the two under consideration. We can express the conditional correlation of ei and ej 
as 


cond corr(e r 5j) • -r^Mr^r^) (11.2-6) 

where r * A' 1 . This Is slmiler to the expression for the unconditional correlation, the difference being that 
r replaces a and the sign Is changed. 

If there are only two unknowns, the conditional and unconditional correlations are Identical. If there 
are more than two unknowns, the conditional and unconditional correlations can give quite different pictures. 
Consider the case In which r Is an N-by-N matrix with l's on the diagonal and with all of the off-diagonal 

elements equal to X. As X, the conditional correlation, approaches -1/(N - 1), the full correlation 

approaches 1. In the limit, when X equals -I/(N - 1), the r matrix Is singular. Thus, for large N, the 

full correlations can be quite high even when all of the conditional correlations are low. This same example 

Inverts to show that the converse also Is true. 

There are three objections to using the correlations, full or conditional, as primary Indicators of accu- 
racy. First, although the correlations give information about the shape of the confidence ellipsoid, they 
completely Ignore its size. Figure (11.2-2) shows two confidence ellipsoids. Ellipse A Is completely con- 
tained within ellipse B and is, therefore, clearly preferable! yet ellipse B has zero correlation and 
ellipse A has significant correlation. From this example, it is obvious that accurate estimates can have 
high correlations and poor estimates can have lo* correlations. To evaluate the accuracy of the estimates, 
you need Information about the sensitivities as *.*11 as about the correlations; neither alone Is adequate. 

Af a more concrete example of the Interplay between correlation and sensitivity, consider a scalar linear 
system: 

zi^) - Du(t 1 ) + H (11.2-7) 

We wish to estimate D. Both D and the bias H are unknown. The Input u(tj) Is an angular position of 
some control device. Suppose that the Input time-history Is as shown in Figure (11.2-3). A large portion of 
the energy In this Input Is from the steady-state value of 90°; the energy in the pulse is much smaller. This 
Input Is highly correlated with a constant bias Input. Therefore, the estimate of D will be highly corre- 
lated with the estimate of H. (If this point Is not obvious, we can choose a few time points on the figure 
and compute the corresponding covariance matrix.) The sensitivity with respect to D is high; because of the 
large values of u, small changes in D cause large changes In z. 

Now we consider the same system, with the Input shown In Figure (11.2-4). Both the correlation and the 
sensitivity are much lower than they were for the Input of Figure (11.2-3). These changes balance each other, 
resulting In the same accuracy in estimating 0. The Inputs shown in the two figures are Identical, but mea- 
sured with respect to reference axes rotated by 90°. The choice of reference axis Is a matter of convention 
which should not affect the accuracy; It does, however, affect both the sensitivity and correlation. 

This example Illustrates that the correlation alone is not a reasonable measure of accuracy. By redefin- 
ing the reference axi: of the Input in this example, we can change the correlation at will to any value between 
-1 and 1. 

The second objection to the use of correlations as Indicators of accuracy Is more serious because It 
cannot be answered by simply looking at sensitivities and correlations together. In the same way that sensi- 
tivities are one-dimensional tools, correlations are two-dimensional tools. The utility of a tool restricted 
to two-dimensional subspaces Is limited. Three simple examples of Idealized but realistic situations serve to 
illustrate the dimensional limitations of the correlations. These examples Involve free lateral -directional 
oscillation of an aircraft. 

For the first example, there Is a yaw-rate feedback to the rudder and a rudder- to-a Heron interconnect. 
Thus the aileron and rudder signals are both proportional to yaw rate. In this case, the conditional correla- 
tions of the aileron, rudder, and yaw- rate derivatives are 1 (or nearly so with Imperfect data). Conditioned 
on the aileron derivatives being known exactly, changes in the rudder derivative estimates can be exactly 
compensated for by changes In the yaw-rate derivative estimates; thus, the conditional correlation Is 1. The 
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unconditional correlations, however, are easily seen to be only 1/2. Changes in the rudder derivative esti- 
mates must be compensated for by some combination of changes In the aileron and yaw-rate derivative estimates. 
Since there are no constraints on how much of the compensation must come from the aileron and how much from 
the yaw-rate derivative estimates, the unconditional correlations would be 1/2 (because, on the average, 

1/2 o/ the compensation would come from each source). 

For the second example, no feedback 1$ present and there Is a neutrally damped, dutch-roll oscillation 
(or a wing rock). The sideslip, roll-rate, and yaw-rate signals are thus all sinusoids of the same frequency, 
with different phases and amplitudes. Taken two at a time, these signals have low correlations. The condi- 
tional correlations consider only two parameters at a time, and thus the conditional correlations of the 
derivatives will be low. Nonetheless, the three signals are linearly dependent when all are considered 
together, because they can all be written as linear combinations of a sine wave and a cosine > - • at the dutch- 
roll frequency. The unconditional correlations of the derivatives will be 1 (or nearly so wltn Imperfect 
data). 

Both of the above examples have three-dimensional correlation problems, which prevent the parameters from 
being Identifiable. The conditional correlations are low In one case, and the unconditional correlations are 
low In the other. Although neither alone 1$ sufficient, examination of both the conditional and unconditional 
correlations will always reveal three-dimensional correlation problems. 

For the third example, suppose that a wing leveler feeds back bank angle to the aileron, and that a 
neutrally damped dutch roll Is present with the feedback on. There are then four pertinent Ignals (sideslip, 
roll rate, yaw rate, and aileron) that are sinusoids with the same frequency and different phases. In this 
case, both the conditional and the unconditional correlations will be low. Nonetheless, there Is a correlation 
problem which results In unidentifiable parameters. This correlation problem Is four-dimensional and cannot 
be seen using the two-dimensional correlations. 

The full and conditional correlations are closely related to the eigenvalues of 2-b y-2 submatrices of the 
A and r matrices, respectively, normalized to have unity diagonal elements. Specifically, the eigenvalues are 
1 plus the correlation and 1 minus the correlation; thus, high correlations correspond to large eigenvalue 
spreads. Higher-order correlations would be Investigated using eigenvalues of larger submatrices. Looked at 
In this light, the Investigation of 2-by-2 submatrices Is revealed as an arbitrary choice dictated by its 
familiarity more than by any objective criterion. The eigenvalues of the full normalized A and r matrices 
would seem more approbate tools. These eigenvalues and the corresponding eigenvectors can provide some 
Information, but they are seldom used. In principle, small eigenvalues of the normalized r matrix or large 
eigenvalues of the normalized A matrix Indicate correlations among the parameters with significant components 
In the corresponding eigenvectors. Note that the eigenvalues of the unnormal Ized r and A matrices are of 
little use In studying correlations, because scaling effects tend to dominate. 

The last objection to the use of the correlations Is the difficulty of presentation. It Is Impractical to 
display the estimated correlations graphically in a problem with more than a handful of unknowns. The most 
common presentation Is simply to print the matrix of estimated correlations. This option offers little 
Improvement In comprehensibility over simply printing the A matrix. If there are a large number of experi- 
ments, it Is pointless to print all of the correlation matrices. Such a nongraphlcal presentation cannot 
reasonably give a coherent picture of the system analyzed. 

11.2.3 Cramer-Rao Bound 

The Cramer-Rao bound Is the last of the statistics based on the confidence ellipsoid. It proves to be the 
most useful of these statistics. The Cramer-Rao bound Is often referred to by other names, including the 
standard deviation and the uncertainty level. We will consider both statistical and nonstatlstlcal Interpreta- 
tions of the Cramer-Rao bound. 

The Cramer-Rao bound of an estimated scalar parameter Is the standard deviation of the error In that 
parameter. Strictly speaking, the term Cramer-Rao bound applies only to the approximation to the standard 
deviation obtained from the Cramer-Rao Inequality. For the purposes of this section, the properties are simi- 
lar, regardless of the source of the standard deviation. In terms of the a matrix, the Cramer-Rao bound of 
the 1th element of 5 Is (A^) 1 ' 2 . 

The Cramer-Rao bound Is closely related to the Insensitivity. Both are standard deviations of the error, 
the only difference being that the Insensitivity Is the conditional standard deviation, whereas the Cramer-Rao 
bound Is unconditional. They are also computationally similar, the difference being In whether the Inversion 
Is of the matrix or of the Individual element. 

The geometric relationship between the Cramer-Rao bound and the Insensitivity Is particularly revealing. 
The Cramer-Rao bound on $1 Is the larges^ change that you can make In and still remain within the confi- 
dence ellipsoid. Durlna this search, the other components are free to take any values that keep the point 
within the confidence ellipsoid. This definition Is Identical to the geometric definition of the Insensitiv- 
ity, 'except that the other components art constrained to the estimated values In the definition of Insensitiv- 
ity. This constraint Is directly related to the statistical conditioning In the definition of the Insensitiv- 
ity; the Cramer-Rao bound has no such constraints and Is an unconditional standard deviation. 

The Cramer-Rao bound must always be at least as large as the Insensitivity, because releasing a constraint 
can never make the solution of a maximization problem smaller. This fact relates to our previous statement 

that correlation effects can increase, but nr *r decrease, the error band defined by the Insensitivity. 

Figure (11.2-5) Illustrates the geometric Inc pretatlon of the Cramer-Rao bounds and Insensitivities In a 
two-dimensional example. 

To p**ove that the Cramer-Rao bound Is the solution to the above optimization problem, we will state and 

prove a more general result. (The general result is actually easier to prove.) 
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Theorem 11. 2-1 Given a fixed vector x and a positive definite symmetric 
matrix H. the maxim um of x*y. subject to the constraint that x*Hx % 1, Is 
given by /(y*h n yj. 

Proof Since x*y has no unconstrained local extrema, the solution must 
TTeon the constraint boundary; therefore, the Inequality In the constraint 
can be replaced by an equality. This constrained optimization problem can 
be restated by the use of Lagrange multipliers (Luenberger, 1969) as the 
unconstrained minimization of 

f(x,M • x*y - \ x(x*Hx - 1) (11.2-8) 

where x is the scalar Lagrange multiplier. The maximum Is found by setting 
the gradients to zero as follows: 


0 • 7 x t(x,X) * y - XHx 

(11.2-9) 

0 » ^ f (x,x) « - £ (x*Hx - 1) 

(11.2-10) 

From Equation ( XI .2-9) we have 

x • x" l H‘ x y 

•a) 

Substituting this Into Equation (11.2-10) gives 

y*H“ l X* l HX“ 1 H" l y - 1 - 0 

(11.2-12) 

or 

X“*y*H _1 y • 1 

(11.2-13) 

or 

\ ■ /(y*(HyJ 

(11.2-14) 

Substituting Into Equation (11.2-11) gives 

, . J !Ql 
/T yWTyT 

(11.2-15) 


and thus 


x*y 


y*H* x y 

/(pro 


vW* yj 


at the solution. This Is the result sought. 

The specific case of y being a unit vector along the ti axis gives the form claimed for the Cramer-Rao 
bound of the ti element. 


The general form of Theorem (11.2-1) has other applications. The value of any llnev combination of the 
parameters can be expressed as {*y for some fixed y- vector. Thus the general form shows how to evaluate 
the accuracy of arbitrary linear combinations of parameters. This form applies to many situations where the 
sum, difference, or other combination of multiple parameters Is of Interest. 


On the basis of this geometric picture, we can think of the Cramer-Rao bounds as 1nsens1t1v1t.es that are 
computed accounting for all parameter correlations. The computation and Interpretation of the Cramrr-Rao 
bounds are valid In any number of dimensions. In this respect, the Cramer-Rac bounds contrast with ^he Insen- 
sitivities, which are one-dimensional tocls, and the correlations, which are two-dimensional tools. The 
Cramer-Rao bounds art thus the best of the theoretical measures of accuracy that can be evaluated for \ single 
experiment. 


11.3 OTHcR MEASURES OF ACCURACY 

The previous sections have discussed the Cramer-Rao bound and other accuracy statistics based on the con- 
fidence ellipsoid. Although the Cramer-Rao oound Is the best single analytical measure of accuracy, over- 
reliance on any single source of accuracy data Is dangerous. Uncritical use of the Cramer-Rao bound can give 
extremely misleading results In realistic situations, as discussed by Maine and 11 Iff (1981b). This section 
discusses alternate accuracy measures, which can supplement the Cramer-Rao bound. 

11.3.1 Bias 

The bias of an estimator is occasionally cited as an Indicator of accuracy. Me do not consider It a 
useful Indicator In most circumstances. This section 1$ limited to a brief exposition of the reasons for this 
judgment. 
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Section 4.2.1 defines the bias of an estimator. Bias arises from several sources. Some estimators are 
Intrinsically biased* regardless of the mature of the data. Random noise In the data often causes a bias. 

The bias from random noise sometimes goes to zero asymptotically for estimators matched to the noise character 
istlcs. Finally, the Inevitable modeling errors In analyzing real systems cause all estimators to be biased, 
even asymptotically. Most discussions of bias refer, implicitly or explicitly, to asymptotic bias. Even for 
Idealized cases with no modeling error, estimators are seldom unbiased for finite time. 

There are two reasons why the bias Is of minimal use as a measure of accuracy. First, the bias reflects 
only the consistent errors; it gnores random scatter. As illustrated In Section 4.2.1, It is possible *or an 
estimator to give ludicrous Individual estimates which average out to a small or zero bias. This property Is 
Intrinsic to the definition of the bias. 

Second, the bias Is difficult to compute In most cases. If we could compute the bias, we could subtract 
It from the estimates to obtain revised estimates that were unbiased. (Some estimators use this technlq"*, ) 

In some cases, It may be practical to compute a bound on the magnitude of the bias from a particular 
source, even when we cannot compute the actual bias. Although they are rarely used, such bounds can give a 
reasonable indication of the likely magnitude of the er^or from some sources. This is the most constructive 
use of bias Information In evaluating accuracy. 

In contrast, the often-repeated statements that a given estimator Is or Is not asymptotically unbiased 
are of little practical use. Most of the estimators considered In this document are asymptotically unbiased 
when the assumptions used In the derivation are true. The statement that ocher estimators are biased under the 
same conditions amounts to a restatement of the universal principle that estimators a**e biased In the presence 
of modeling error. Thus arguments about which of two estimators Is bias*' 1 are silly. Thes arguments reduce 
to the issue of what assumptions to use, an Issue best addressed directly. 

Although quantitative measures of bias may not be available, the analyst should always consider the Issue 
of bias due to modeling error. Bias errors are added to all other typ*s of error In the c.ilmates. Unfor- 
tunately, some bias errors are Impossible to detect solely by analyzing the data. The estimates can be 
repeatable with little scatter and appear to be accurate by all other measures, and still have large olas 
errors. An example of this type of problem Is a calibration error In a nonredundant Instrument. Th only way 
to avoid such problems Is to be meticulous In executing and documenting every step of the application, Includ- 
ing modeling, instrumentation, and data handling, No automatic tests e*ist that adequately substitute for 
such care. 

11.3.2 Scatter 


When there are several experiments at the same condition, the scatter of the estimates Is an Indication 
o" accuracy. We can also evaluate scatter about a smooth fairing of the estimates In a series of experiments 
with gradually changing conditions. This approach assumes that the parameters change smoothly as a function 
of experimental condition. 

The scatter has a significant advantage over many of the theoretical measures of accuracy discussed below. 
The scatter measures the actual performance that some of the theoretical measures are trying to predict. 
Therefore the scatter Includes several effects, such as random errors In measuring the experiment conditions, 
that are Ignored In the theoretical predictions. You can gain the most Information, of course, by considering 
both the observed scatter and the theoretical predictions. 

An Inherent weakness In the use of scatter as a gauge of accuracy Is that several data points are required 
to define It. Depending on the application, this objection can range from inconsequential to Insurmountable. 

A related problem Is that the scatter does not show the accuracy of Individual points* some of which may be 
better than others. For instance, if only two conflicting data points are available, the scatter gives no h‘nt 
as to which Is more reliable. Figure (11*3-1) shews estimates of the parameter Cn p obtained from flight data 
of a PA-30 aircraft. The scatter Is large* showing estimates of both signs. 

Figure (11.3-2) shows the same data segregated Into rudder and aileron maneuvers. In this case, the 
scatter makes It evident that the aileron maneuvers result In far more consistent estimates of Cnp tnan do 
the rudder maneuvers. Had there been only one or two aileron and one or two rudder mcneuvers available, there 
would have been no way to deduce from the scatter that the aileron maneuvers were superior for estimating this 
parameter. 

The scatter shares a weakness with most of the theoretical accuracy measures In that It does not account 
for consistent errors (1.e<» biases). Many occurrences can result In small scatter about an Incorrect value. 
The scatter, therefore, should be regarded as a lover bound. The estimates can be worse than Is Indicated by 
the scatter, bui are seldom better. 

Maine and II Iff (1981b) discuss well -documented situations In which the scatter Is significantly larger 
than the Cramer-Rac bounds. In all such cases, we regard the scatter as a more realistic measure of the mag- 
nitude of the errors. The Cramer-Rao bound Is still a reasonable means of determining which Individual experi- 
ments are most accurate, but may not give a reasonaule magnitude of the error. 

In spite of Its problems, the data scattar 1$ an easily used tool for evaluating accuracy, and It should 
always be examined when sufficient data points are available to define It. 

11.3.3 Engineering Judgment 

Engineering judgment Is the oldest measure of estimate reliability. Even with the theoretical accuracy 
measures now available, the need for judgment remains; the theoretical measures are merely tools which supply 
more Information on which to base the judgment. By definition, the process of applying engineering judgment 
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cainot be descrioed precisely and quantitatively, or there would be no judgment Involved. Algorithms can be 
devised to search for specific problems, but the engineer still needs to make a final unautomated judgment. 
Therefore, this section will simply list some of the factors most often considered in making a judgment. 

One of the most basic factors in judging the accuracy of the estimates is the anticipated accuracy. The 
engineer usually has a priori knowledge of how accurately one can reasonably expect to be able to estimate 
the parameters. This knowledge can be based on previous experience, awareness of the relative importance and 
linear dependence of the parameters, and the quality of experimental data obtained. 

Another basic criterion is the reasonability of the estimated parameter values. Before analysis is begun, 
we usually know the approximate range of values of the parameters. Drastic deviations from this range are 
reason to suspect the estimates unless we discover the reason for the poor prediction or we independently 
verify the suspect value. 

We have previously mentioned the role of engineering judgment in evaluating model adequacy. The engineer 
must look for violations of specific assumptions made in deriving the model, and for jnexplainod problems that 
may Indicate modeling errors. Both the estimator and the theoretical measures of accuracy can be invalidated 
by modeling errors. The magnitude of the modeling-error effects must be judged. 

The engineer judges the quality of the fit of the measured and estimated time histories. The characteris- 
tics of this fit can give indications of many problems. Many modeling error problems first become apparent as 
poor time-history fits. Failed sensors and data processing errors or omissions are among the other classes uf 
problems which can be deduced from the fits. 

Finally, engineering judgment is used to assemble and weigh all of the available information about the 
estimates You must combine the judgmental factors with information from the theoretical tools in order to 
give a fi *1 best estimate of the parameters and of their accuracies. 


11.4 MODEL STRUCTURE DETERMINATION 

In the previous sections, we have largely assumed that the assumed model form is correct. This is never 
strictly true in practice. Therefore, we must always consiuer the possible effects of modeling error as a 
special issue. The tools discussed in Section 11.3 can help in the evaluation of these effects. 

In this section, we specifically examine the question of determining the best model struct* re for parair* 
eter estimation. One approach to minimizing the effects of model structure errors is to use a model structure 
which is close to that of the true system. There are, however, definite limits to this principle. The limita- 
tions arise both in how accurate you can make the model and in how accurate you should make it. 

In the field of simulation, it is almost axiomatic that the simulation fidelity improves as more detail 
is added t: th^ model. Practical considerations of cost and the degree of required fidelity dictate the level 
of detail included in the model. Simulation and system identification are closely related fields, and we 
might expect that such a basic principle would be common to both. Contrary to this expectation, system identi- 
fication sometimes obtains better results from a simple than from a detailed model. The use of too detailed a 
model is probably one the most common sources of difficulty in the practical application of system 
identification. 

The problems that arise from too detailed a model are best illustrated by a simple example. Presume that 
Figure (11.4-1) shows experimental data from a system with a scalar input U, and a scalar output Z. The line 
in the figure is the best linear fit to the data. This line appears to be a reasonable representation of the 
system. 

To investigate possible nonlinear effects, consider the case of polynomial models. It is obvious that the 
error between the model output and the experimental data will become smaller as the order of the model 
increases. High-crder polynomials include lower-order polynomials as specific cases (we have no requirement 
that the high-order coefficient be nonzero), so the best second-order fit is at least as good as the best 
linear fit, and so forth. When the order of the polynomial becomes one less than the number of data points, 
the model will exactly match the experimental data (unless input values were repeated). 

Figure (11.4-2) shows such a perfect match of the data from Figure (11.4-1), Although the data points are 
matched perfectly, the curve oscillates wildly. The simple linear fit of Figure (11.4-1) is probably a much 
better representation of the system, even though the model of Figure (11. 4-2) is more detailed. We could say 
that the model of Figure (11.4-2) Is fitting the noise Instead of the truj response. 

Essentially, as the model complexity increases, and more unknown parameters are estimated, the problem 
approaches the black-box system- Identification problem where there are no assumptions about the model form. We 
have previously snown that the pure black-box problem is Insoluble. One can deduce only a finite amount of 
information about the system from a finite amount of experimental data. The engineer provides, In the form of 
an assumed model structure, the rest of the information required to solve the system-identification problem. 

As the assumed model structure becomes more general, it provides less information, and thus more of the infor- 
mation must be deduced from the experimental data. Eventually, one reaches a point where the information 
available Is Insufficient; the estimation algorithms then perform poorly, giving ridiculous results. 

The Cramer-Rao bound gives a statistical basis for estimating whether the experimental data contain suffi- 
cient information to reliably estimate the parameters In a model. This and related statistics can be used to 
determine the number and selection of terms to Include in the model (Klein and Batterson, 1983; Gupta, Hall, 
and Trankle, 1978; and Trankle, Vincent, and F -kiln, 1982). The basic principle is to include In the model 
o;ily those terms that can be accurately estlm from the available experimental data. This process, known as 
model structure determination, Is described in rurther detail In the cited references. We will restrict our 
discussion to the general nat'.re and applicability of model structure determi ration. 
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Automatic model structure determination Is often viewed as a panacea that eliminates the necessity for 
model selection to be based on engineering judgment and knowledge of the phenomenology of the system. Since 
we have repeatedly emphasized that pure black-box system Identification Is Impossible, such claims for auto- 
matic model determination must bt viewed with suspicion. 

There is a basic fallacy in the argument that automatic model structure determination can replace engi- 
neering judgment In selecting a model. The model structure determination algorithms are not creative; they can 
only test candidate models suggested by the engineer. In fact, the model structure determination algorithms 
are a type of parameter estimation in disguise. In which the parameter is an index indicating which model Is to 
be used. In a way, model structure determination is easier than most parameter estimation. At each stage, 
there are only two possible values for a term, zero or nonzero; whereas most parameter estimation demands that 
a specific value be pirked from the entire real line. This task does not approach the scope of the black-box 
system-identification problem in which the number of possible models is a high order of infinity. 

Engineering judgment Is still needed, therefore, to select the types of candidate models to be tested. If 
the candidate models are not appropriate, the results will be questionable. The very best that could be 
expected from an automatic algorithm in this circumstance would be rejection of all of the candidates (and not 
all automatic tests have even that much capability. No automatic algorithm can suggest creative improvements 
that it has not been specifically programed for. 

Consider a system with an actual output of Z 3 sin(U). Assume that a polynomial model has been selected 
by the engineer, and automatic structure determination has been used to determine what order polynomial to use. 
The task is hopeless in this form. The data can be fit arbitrarily well with a polynomial of a high enough 
order, but the polynomial form does not describe the essence of the system. In particular, the finite poly- 
nomial will not be valid for extrapolating system performance outside of the range of the experimental data. 

In the above system, consider three ranges of U-values: |U| <0.1, |U| < 1.0, and |U| < 10.0. In the 

range ]U| <0.1, the linear polynomial Z = U is a close approximation, as shown in Figure (11.4-3). The 
extrapolation of this approximation to the range |U| < 1.0 introduces noticeable errors, as shown in 
Figure (11.4-4). Over this range, the approximation Z * U - U 3 /6 is reasonable. If we expand our view to 
the range [U | < 10.0, as in Figure (1.5-5), then neither the linear nor the third-order polynomial is at all 
representative of the sine function. It would require at least a seventh-order polynomial to match even the 
gross characteristics of the sine function over this range; a good match would require a still higher order. 

Another problem with automatic model -structure determination is that it gives only a statistical estimate. 
Like all estimates, it ir imperfect. If no better information is available, it is appropriate to use auto- 
matic model structure determination as the best guess. If, however, facts about the model structure are 
deducible from the physics of the system, it is silly to throw away known facts and use imperfect estimates. 
(This is one of the most basic principles in the entire field of system identification, not just In model 
structure determination: if a fact is known, use it and save the estimation theory for cases in which it is 

needed.) 

The most basic problem with automatic model structure determination lies in the statement of the problem. 
The very term “model structure determination' 1 is misleading, because there is seldom a correct model to deter- 
mine. Even when there is a correct model, it may be far too complicated for practical purposes. The real 
model structure determination problem is not to determine some nonexistent “correct" model structure, but to 
determine an adequate model structure. We discussed the idea of adequate models in Section 1.4; the Idea of 
an adequate model structure Is an intimate p^rt of the idea of an adequate model. 

This basic issue is addressed briefly, if at all, in most of the literature on model structure determina- 
tion. Many papers generate simulated data with a specified model, and then demonstrate that a proposed model 
structure determination algorithm can determine the correct model . This approach has little to do with the 
real issue in model structure determination. 

The previous paragraphs have emphasized the numerous problems of automatic model structure determination. 
That these problems exist does not mean that automatic mooel -structure determination is worthless, only that 
the mindless application of it is dangerous. Automatic model structure determination can be a valuable tool 
when u*ed with an appreciation of its limitations. Most good model structure determination programs allow the 
engineer to override the statistical decision and force specific terms to be Included or omitted. This 
approach makes good use of both the theory and the judgment, so that the theory Is used as a tool to aid the 
judgment and to warn against some types of poor judgment, but the end responsibility lies with the engineer. 


11.5 EXPERIMENT DESIGN 

The previous discussion has, for the most part, assumed that a specific set of experimental data has 
already been gathered. In some cases, this Is a valid assumption. In other cases, the opportunity is avail- 
able to specify the experiments to be performed and the measurements to be taken. This section gives a brief 
overview of the subject of designing experiments for parameter identification. We leave detailed discussion 
to works cited In the references. 

Methods for experiment design fall into two major categories. The first category Is that of methods based 
on numerical optimization. Such methods choose an input, subject to appropriate constraints, which minimizes 
the Cramer-Rao bound or some related error estimate. Goodwin (1962) and Plaetschke and Schulz (1979) give 
theoretical and practical details of some optimization approaches to input design. 

Experiment design Is often strongly constrained by practical considerations; in the extreme case, the 
constraints completely specify the input, leaving no latitude for design. In a design based on numerical opti- 
mization, the constraints must be expressed mathematical ly. This derivation of such expressions is sometimes 
straiqht forward, as when a control device Is limited by a physical stop at a specific position. In other 
cases, the constraints Involve Issues such as safety that are difficult to quantify as precise limits. 
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Slight changes in the form of the constraints can change the entire character of the theoretical optimum 
Input. Because the constraints are one of the major influences in the experiment design, adopting simplified 
constraint forms solely becuuse they are easy to analyze is often inadvisable. In particular, “soft" con- 
straints in the form of a cost penalty proportional to the square of the input are almost never accurate 
representations of practical constraints. 

Host practical experiment design falls into the second major category, methods based more on heuristic 
design than on formal optimization of a cost function. Such designs draw heavily on the engineer’s understand- 
ing of the system. There are several widely applicable rules of thumb to help heuristic experiment design; 
some of them consider issues such as frequency content, modal excitation, and Independence. Plaetschke and 
Schulz (1979) describe some of these rules, and evaluate inputs based on them. 




Figure (11.1-3). Construction of two-dimensional 
confidence ellipsoid. 



Figure (11,1-2). Construction of one-dimensional 
confidence ellipsoid. 
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Figure (11.4-2). Exact polynomial match of 
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Figure (11.4-5), 1 * sin(U) In the range |U| < 10.0. 
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CHAPTER 12 


12.0 SUMMARY 

In this document, we have presented the theoretical background of statistical estimators for dynamic 
systems, with particular emphasis or. maximum-1 ike 1 i hood estimators. An understanding of this theoretical back- 
ground is crucial to the practical application o f the estimators; the analyst needs to know the capabilities 
and limitations cf the estimators. There are several examples of artificially complicated problems that suc- 
cumb to simple approaches, and seemingly trivial questions that have no answers. 

A thorough understanding of the system being analyzed is necessary to complement this theoretical back- 
ground. No amount of theoretical sophistication can compensate for the lack of such understanding. The entire 
theory rests on the basis of the assumptions made about the system characteristics. The theory can give only 
limited help in validating or refuting such assumptions. 

Errors and unexpected difficulties are inevitable in any substantial parameter estimation project. The 
eventual success of the project hinges on the analyst's ability to recognize unreasonable results and diagnose 
their causes. This ability* in turn, requires an understanding of both estimation theory and the system being 
analyzed. Problems can range from obvious instrumentation failures to subtle modeling inconsistencies and 
identifir.bility problems. 

Probably the most difficult part of parameter estimation is to straddle the fine line between models too 
simple to adequately represent the system and models too complicated to be identifiable. There is no conser- 
vative position on this ^ssue; excesses in either direction can be fatal. The solution is typically iterative, 
using diagnostic skills to detect problems and make improvements until an adequate result is obtained. The 
problem is exacerbated by there being no correct answer. 

Neither is there a single correct method to solve parameter estimation problems. Although we have casti- 
gated some practices as demonstrably poor, we make no attempt to establish as dogma any particular method. 

The material of this document is intended more as a set of tools for parameter estimation problems. The selec- 
tion of the best tools for a particular task is influenced by factors other than the purely theoretical. 

Better results often come from a crude, but adequate, method that the analyst thoroughly understands than from 
a sophisticated, but unfamiliar, method. We retonmend the attitude expressed by Gauss (1809, p. 108): 

If. is always profitable to approach the more difficult problems in several 
ways, and not to despise the good although preferring the better. 
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APPENDIX A 


A.O MATRIX RESULTS 

This appendix presents several matrix results used In the body of the book. The derivations are mostly 
exercises in simple matrix algebra. Various of these results are given in numeroui other documents; Goodwin 
and Payne (2977, appendix E) present most of them. 

A.l MATRIX INVERSION LEMMAS 

Consider a square, nonsingular matrix A, partitioned as 


, . [*■■ ‘“i 

l/ji a jJ 

where A n and a J 2 are square. Define the Inverse of A to be r, similarly partitioned as 


r r “ r «i 

a 1 = r = 1 I 

l r 21 r 22j 

(A. 1-2) 


where r lx is the same size as A X1 . We want to express the partitions nj in terms of the 
derive such expressions, we need to assume that either a 21 or a 2? is invertible; if both are 
is no useful form. Consider first the case where a 12 is invertible. 

Aij. To 
singular, there 

* 

Lenina A.l *1 Given A and r partitioned as in Equations (A. 1-1) and (A. 1-2), 
assume that a and a 21 are invertible. Then (a 22 - A 2 i A ii A i 2 ' is invertible 
and the partitions of r are given by 


k 

\ 

^11 * A 11 ~ ^11^12^22 " A 21 A 11 A 12) A 21 A h 

(A. 1-3) 


r i2 = ~ A 11 A 12( A 2Z * A Z1 A 1 l A l 2 ) 1 

(A. 1-4) 


r 21 = "(^22 ~ A 21 A 11 A 12^ A 21 A 11 

(A. 1-5) 

i 

r 22 = t A 22 " A 21 A 11 A 12^ 

(A. 1-6) 

i 

Proof The condition Ar = I gives the four equations 


1 

A li r i2 + A U r 22 * 0 

(A. 1-7) 

i 

i 

A 2i r il + A 22 r 21 = 0 

(A. 1-8) 

5 

1 

A ll^ll + A 12^21 " ^ 

(A. 1-9) 

i 

A 21^12 + A 22^22 * I 

(A. 1-10} 

l'-** 

and the condition ta * I gives the four equations 



r il A 12 + r i2 A 22 * 0 

(A. 1-11) 


^2 1 A 1 J + ^ 2 2 A 2 I “ ^ 

(A. 1-12) 


r il A ll + r i2 A 2l * I 

(A. 1-13) 


^21 A 12 + r 22 A 22 = I 

(A. 1-14) 


Equations (A. 1-7) and (A. 1-12), respectively, give 


1 

r j2 * - A U A l2 r 22 

(A. 1-15) 

I 

^21 “ "^2 2 A 2 1 A 1 1 

(A. 1-16) 

i 

Substitute Equation (A, 1-15) Into Equation (A, 1-10) and suostltute Equa- 
tion (A. 1-16) Into Equation (A. 1-14) to get 



( A 2 2 “ A 2 1 A 1 1 A 12 ^22 * ^ 

(A. 1-17) 


^22 ^ A 22 * A 2 1 A 1 1 A 1 2 ) * ^ 

(A. 1-18) 
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A.l 


The case where 


When both A X1 


By the assumption of Invertlblllty of A, the exist and satisfy 
Equations (A. 1-7) to (A. 1-14). The assumption of invertlblllty of A 1X then 
assures, through the above substitutions, that r 22 satisfies Equa- 
tions (A. 1-17) and (A. 1-18). Therefore (a 22 - A 2 l AjjA 12 ) is invertible and 
r 22 is given by Equation (A. 1-6). 

Substituting Equation (A. 1-6) into Equations (A. 1-15) and (A. 1-16) gives 
Equations {A. 1-4) and (A. 1-5). Finally, substituting Equation (A. 1-5) into 
Equation (A. 1-9) and solving for r l3l gives Equation (A. 1-3), completing 
the proof. 

a 22 is nonsingular is simply a permutation of the same lenma. 

Lemma A. 1-2 Given A and r partitioned as in Equations (A. 1-1) and (A. 1-2), 
assume - that A and a, 2 are Invertible. Then (A 1X - A 12 a x }a 21 ) is invertible 
and the partitions of r are given by 

B ( A n * ^12^22^21) 

r i 2 * "(Aji " A 12 A 22 A 21 ) A12A22 

r 2 i s * A 2 2 A 2 l (A 11 - A 12 A 22 A 2l ) 1 

r 22 * A 22 * A 22 A 2l( A ll “ A 12 A 22 A 2l) lA 12 A 22 


Proof Define a reordered matrix 



The inverse of A' is given by the corresponding reordering of r. 



Then apply the previous lemma to A' and r'. 

and a 22 are invertible, we can combine the above lemmas to obtain two other useful 

Lenma A. 1-3 Assume that two matrices A and C are invertible. Further 
assume that one of the expressions (A - BC“ 1 D) or (C - DA‘ 1 B) is invertible. 

Then the other expression is also invertible and 

(A - BC” X D) “ 1 * A* 1 - A’ 1 B(C - DA“ l B)“ l DA“ 1 

Proof Define A n * A, a , 2 * B, A 2 , - D, and a 22 = C. In order to apply 
Lenma s (A. 1-1) and (A. 1-2), we first need to show that A as defined by 
Equation (A. 1-1) is invertible. 

If (C - DA M B) is invertible, the" the vu defined by Equations (A. 1-3) 
to (A. 1-6) satisfy Equations (A. 1-7; to (A.l-14). Therefore A is Invertible. 

Lenma (A. 1-2) then gives the invertibility of (A - BC l D), which is one of 
the desired results. 

Conversely, if we assume that (A - BC“ 1 D) is Invertible, then the r-f j 
defined by Equations (A. 1-19) to (A. 1-22) satisfy Equations (A. 1-7) to (A.l-14). 
Therefore A is Invertible and Lemma (A. 1-1) gives the invertlblllty of the 
expression (C - DA” 1 B). 

Thus the invertibility of either expression Implies Invertibility of the other 
and of A. We can now apply both Leninas (A. 1-1 ) and (A. 1-2), Equating the 
expressions for r xl given by Equations (A. 1-3) and (A, 1-19), and putting the 
result in terms of A, B, C, and D, gives Equation (A. 1-23), completing the 
proof. 

Lemma A. 1-4 Given A, B, C, and D as In Lenina (A. 1-3), with the same 
Invertibility assumptions, then 

A~ 3 B(C - DA" 1 B)“ l - (A - BC“ l 0)* l BC“ 1 

Proof The proof Is identical to that of Lenma (A. 1-3), except that we equate 
the expressions for r 12 given by Equations (A. 1-4) and (A. 1-20), giving 
Equation (A. 1-24) as a result. 




\ * 


IB* 


(A. 1-19) 
(A. 1-20) 
(A. 1-21) 
(A. 1-22) 


results. 
(A. 1-23) 


(A. 1-24) 
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A. 2 MATRIX DIFFERENTIATION 

For several of the following results, it is convenient to define the derivative of a scalar with respect 
to a matrix. If f Is a scalar function of the matrix A, we define df/dA to be a matrix with elements 
equal to the derivatives of f with respect to corresponding elements of A. 


,'dfV 

w 


(1.J) 


d(A 


df 

UTTK 


(A. 2-1) 


Two simple relations Involving the trace function are useful In manipulating the matrix and vector quan- 
tities we work with. 


Result A.2-1 If x and y are two vectors of the same length, then 

x*y = tr(yx*) 

Proof Both sides expand to ^x^V^- 


(A. 2-2) 


Result A.2-2 If A and B are two matrices of the same size, then 

Yj = tr(AB*) 

I.j 

Proof Expand the right side, element by element. 


(A. 2-3) 


Both of these results are special cases of the same relationship between inner products and outer products. 
The following result is a particular application of Result (A.2-2). 


Result A. 2-3 If f (A) is a scalar function of the matrix A, and A is a 
function of the scalar x, then 


df - tr 

Hx ' tr 


(«) 


Proof Use the chain rule with the individual elements of A to write 

df . V sf dA (1 ’ J) 

dx ' Li ,«(i .j) 3x 
i.j 

Equation (A. 2-4) then follows from Result (A.2-2) and the definition given 
by Equation (A.2-1). 

Result A. 2-4 If the matrix A is a function of x, then 


£ (A' 1 ) * 'A 


l (dx) A 1 


wherever A is invertible. 

Proof By the definition of the inverse 

AA - 1 = I 

Take the derivative, using the chain rule. 


dx 


(AA" 1 ) 


dl 

Hx 


$ A*‘ + A&CA-‘).0 

Solving for d/dx(A" x ) gives Equation (A. 2-6), as desired. 

R esult A. 2-5 If A is invertible, and x and y are vectors, then 

|f (x*A' l y) = -(A*»yx*A-‘)* 

Proof Use result (A. 2-4) to get 

8A 


Now 




?A , . 

tot e i j 


(A. 2-4) 


(A. 2-5) 


(A. 2-6) 

(A. 2-7) 

(A. 2-8) 
(A. 2-9) 

(A. 2-10) 
(A. 2-11) 


aA 


(A. 2-12) 
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where e* Is e vector with zeros In alt but the 1th element, which Is 1. 
Therefore 

* -x*A" l e^ejA‘ l y ■ -ejA l yx*A" l e^ 

which Is the (1,j) element of -(A‘ 1 yx*A“ 1 ) w . The definition of the matrix 
derivative then gives Equation (A. 2-10) as desired. 

Result A. 2-6 If A is Invertible, then 


JK In l A l * A*' 1 

Proof Expanding the determinant by cofactors of the 1th row gives 
*n|A| ■ in ^ A^ 1,k )(adj A)^*^ 

Taking the derivative with respect to A^*^ gives 




»n|A| 


?* 




(j.1) 


(adj A) 


(k.i 


T- 

/ 


because (adj A)^’ 1 ^ does not depend on A^* 1 ^. Using Equation (A. 2-15) and 
the expression for a matrix Inverse In terms of the matrix of cofactors, we 
get 


Equation (A. 2-14) then follows, as desired, from the definition of the 
derivative with respect to a matrix. 


A. 2 


(A. 2-13) 


(A. 2-14) 


(A. 2-15) 


(A. 2-16) 


(A. 2-17) 
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131 


REFERENCES 


Acton, Forman S . : Numerical Methods that Work, harper & Rom, New York, 1970. 

Akalke, Hlrotugu: A New Look at Statistical Model Identification. IEEE Trans. Automat. Contr., Vol. AC-19. 

No. 6, pp. 716-723, 1974. 

Aokl, Masanao: Optimization of Stochastic Systems. Academic Press, New York, 1967. 

Apostol, Tom M.: Calculus: Volume II. Xerox College Publishing, Waltham, Mass., 2nd ed., 1969. 

Ash, Robert B.: Basic Probability Theory. John Wiley & Sons, Inc., New York, 1970. 

Astrom, Karl J.: Introduction to Stochastic Control Theory. Academic Press, New York, 1970. 

Astrom, Karl J. and Eykhoff, P.: System Identification- A Survey. Automatlca, Vol. 7, pp. 123-162, 1970. 

Bach, R. E. and Wlngrove, R. C.: Applications of State Estimation In Aircraft Flight Data Analysis. AIAA 
paper 83-2087, 1983. 

Balakrlshnan, A. V.: Stochastic Differential Systems I. Filtering and Control -A Function Space Apprach. 
Lecture Notes In Economics and Mathematical Systems, 84, M. Beckman, G. Goos, and H. P. Kunzl, eds., 
Sprlnger-Verlag, Berlin, 1973. 

Balakrlshnan, A. V.: Stochastic Filtering and Control. Optimization Software, Inc., Los Angeles, 1981. 

Balakrlshnan, A. V.: Kalman Filtering Theory. Optimization Software, Inc., New York, 1984. 

Barnard, G. A.: Thomas Bayes Ess^y Toward Solving a Problem in the Doctrine of Chances. Blometrlka, Vol. 45, 
1958. 

Bayes, Thomas: An Introduction to the Doctrine of Fluxions, and a Defence of the Mathematicians Against the 
Objections of the Author of the Analyst. John Noon, 1736. (See Barnard, 1958). 

Blerman, G. J.: Factorization Methods for Discrete Sequential Estimation. Mathematics in Science and Engi- 

neering, Vol. 128, Academic Press, New York, 1977. 

Brauer, Fred and Noel, John A.: Qualitative Theory of Ordinary Differential Equations. W. A. Benjamin, 

New York, 1969. 

Cox, A. B. and Bryson, A. E. : Identification by a Combined Smoothing Nonlinear Programming Algorithm. 

Automatlca, Vol. 16, pp. 689-694, 1980. 

Cranrfr, Harald: Mathematical Methods of Statistics. Princeton University Press, Princeton, N.J., 1946. 

Dixon, L. C. W.: Nonlinear Optimization. Crane, Russak & Co., New York, 1972. 

Doetsch, K. H.: The Time Vector Method for Stability Investigations. A.R.C. R. & M. 2945, 1953. 

Dongarra, J. J.; Moler, C. B»; Bunch, J. R.; and Stewart, G. W.: LINPACK User's Guide. SIAM, Philadelphia, 
1979. 

Etkln, B.: Dynamics of Atmospheric Flight. John Wiley & Sons, Inc., New York, 1958. 

Eykhoff, P.: System Identification, Parameter and State Estimation. John Wiley & Sons, London, 1974. 

Ferguson, Thomas $.: Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York, 1967. 

Fisher, R. A.: On the Mathematical Foundations of Theoretical Statistics. Phil. Trans. Roy. Soc. London, 

Vol. 222, pp. 309-368, 1921. 

Flske, P. H. and Price, C. F.: A New Approach to Model Structure Identification. AIAA paper 77-1171, 1977. 

Flack, Nelson D.: AFFTC Stability and Control Technique. AFFTC-TN-59-21, Edwards, California, 1959. 

Foster, G. W.: The Identification of Aircraft Stability and Control Parameters in Turbulence. RAE TR 83025, 
1983. 

Garbow, B. S.; Boyle, J. M.; Dongarra, J. J.; and Moler, C. B.: Matrix Elgensystem Routines- El SPACK Guide 
Extension. Sprlnger-Verlag, Berlin, 1977. 

Gauss, Karl Friedrich- Theory of the Motion of the Heavenly Bodies Moving About the Sun in Conic Sections. 
Translated by Charles Henry Davis, Dover Publications, Inc., New York, 1847. Translated from: Theorla 
Motus, 1809. 

Geyser, Lucille C. and Lehtlnen, Bruce: Digital Program for Solving the Linear Stochastic Optimal Control and 
Estimation Problem. NASA TN D-7820, 1975. 

Goodwin, Graham C.: An Overview of the System Identification Problem Experiment Design. Sixth IFAC Symposium 
on Identification and System Parameter Estimation, Washington, O.C., 1982. 


Goodwin, Graham C. and Payne, Robert L.: Dynamic System Identification: Experiment Design and Data Analysis. 

Academic Press, New York, 1 "7. 

Greenberg, H. : A Survey of Methods for Determining Stability Parameters of an Airplane from Dynamic Flight 
Measurement. NASA TN-2340, 1951. 

Gupta, N. K.; Hall, W. E.; and Trankle, T. L.: Advanced Methods of Model Structure Determination from Test 
Data. AIAA J. Guidance and Control, Vol. 1, No. 3, 1978. 

Gupta, N. K. ; and Mehra, R. K. : Computational Aspects of Maximum Likelihood Estimation and Reduction in Sensi- 

tivity Function Calculations. IEEE Trans, on Automat. Contr., Vol AC-19, No. 6, pp. 774-783, 1974. 

Hajdaslnskl, A. K.; Eykhoff, P. ; Damen, A. A. H.; and van den Boom, A. J. W. : The Choice and Use of Different 
Model Sets for System Identification. Sixth IFAC Symposium on Identification and System Parameter Esti- 
mation, Washington, O.C., 1982. 

Hodge, Ward F. and Bryant, Wayne H. : Monte Carlo Analysis of Inaccuracies In Estimated Aircraft Parameters 
Caused by Unmodeled Flight Instrumentation Errors. NASA TN D-7712, 1975. 

Jategaonkar, R. and Plaetschke, E.: Maximum Likelihood Parameter Estimation from Flight Test Data for General 
Nonlinear Systems. DFVLR-FB 83-14, 1983. 

Jazwlnskl, Andrew H.: Stochastic Processes and Filtering Theory. Academic Press, New York, 1970. 

Kallath, T. and Lyung, L.: Asymptotic Behavior of Constant-Coefficient Rlccatl Differential Equations. IEEE 
Trans. Automat. Contr., Vol. AC-21, pp. 385-388, 1976. 

Kalman, R. E. and Bucy, R. S.: New Results In Linear Filtering and Prediction Theory. Trans. ASME, Series D. 
Journal of Basic Engineering, Vol. 63, pp. 95-107, 1961. 

Klein, Vladislav: On the Adequate Model for Aircraft Parameter Estimation. CIT, Cransfleld Report Aero No. 28, 
1975. 

Klein, Vladislav and Batterson, James 5.: Determination of * "plane Model Structure from Flight Data Using 
Splines and Stepwise Regression. NASA TP-2126, 1983. 

Kushncr, Harold: Introduction to Stochastic Control. Holt, Rinehart and Winston, Inc., New York, 1971. 

Levan, N.: Systems and Signals. Optimization Software, Inc., New York, 1983. 

Llpster, R. S, and Shlryayev, A. N.: Statistics of Random Processes I: General Theory. Sprlnger-Verlag, 

New York, 1977. 

Luenberger, David G.: Optimization by Vector Space Methods. John Wiley 4 Sons, New York, 1969. 

Luenberger, David G.: Introduction to Linear and Nonlinear Programming. Addison-Wesley, Reading Mass., 1972. 

Maine, Richard E . : Programmer's Manual f*r MMIE3, A General FORTRAN Program for Maximum Likelihood Parameter 
Estimation. NASA TP- 1690, 1981. 

Maine, Richard E. and II Iff, Kenneth W.: User's Manual for MMLE3, A General Fortran Program for Maximum Like- 
lihood Parameter Estimation. NASA TP- 1563, 1980. 

Maine, Richard E. and 11 Iff, Kenneth W.: Formulation and Implementation of a Practical Algorithm for Parameter 

Estimation with Process and Measurement Noise. SIAM J. Appl. Math., Vol. 41, pp. 5S8-579, 1981(a). 

Maine, Richard E. and 11 iff, Kenneth W.: The Theory and Practice of Estimating the Accuracy of Dynamic Flight- 
Determined Coefficients. NASA RP-1077, 1981(b). 

MedUch, J. S.: Stochastic Optimal Linear Estimation and Control. McGraw-Hill Book Co., New York, 1969. 

Mehra, Raman K. and Lalnlotls, Dimitri G. (ed$): System Identi. uatli.;: Advance* and Case Studies. Academic 
Press, New York, 1976. 

Moler, C. B. and Stewart, G. W.: An Algorithm for Generalized Mat! lx Eigenvalue Problems. SIAM J. of Numeri- 
cal Analysis, Vol. 10, pp. 241-256, 1973. 

Moler, Cleve; and Van Loan, Charles: Nineteen Dubious Ways to Compute the Exponential of a Matrix. SIAM 
Review, Vol. 20, No. 4, pp. 801-836, 1978. 

Nerlng, Evar D.: Linear Algebra and Matrix Theory. John Wiley & Sons, Inc., New York, 2nd ed., 1970. 

Paige, Lowell J. ; Swift* J. Dean; and Slobko, Thomas A.: Elements of Linear Algebra. Xerox College Publish- 
ing, Lexington, Mass., 2nd ed., 1974. 

Papoulls, Athanaslos: Probability, Random Variables, and Stochastic Processes. McGraw-Hill Book Co., 

New York, 1965. 

Penrose, R.; A Generalized Inverse for Matrices. Proc. Cambridge Phil. Soc. 51, pp. 4C6-413, 1955. 

Pitman, E. J. G.: Some Basic r heory for Statistical Inference. Chapman and Hall, London, 1979. 



133 


Plaetschke, E. and Schulz, G.: Practical Input Signal Design. AGARD Lecture Series No. 104, 1979. 

Polak, E.: Computational Methods In Optimization: A Unified Approach. Academic Press, New York, 1971. 

Potter, James E.: Matrix Quadratic Solutions. SIAM J. Appl. Math., Vol. 14, pp. 496-501, 1966. 

Rampy, John M. and Berry, Donald T.: Determination of Stability Derivatives from Flight Test Data by Means of 
High Speed Repetitive Operation Analog Matching. FTC-TDR-64-8, Edwards, Calif., 1964. 

Rao, S. $.: Optimization, Theory and Applications. Wiley Eastern Limited, New Delhi, 1979. 

Royden, H. L.: Real Analysis. The MacMillan Co., London, 1968. 

Rudln, Walter: Real and Comp' x Analysis. McGraw-Hill Book Co., New York, 1974. 

Schweppe, Fred C.: Uncertain Dynamic Systems. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1973. 

Sorensen, John A.: Analysis of Instrumentation Error Effects on the Identification Accuracy of Aircraft 
Parameters. NASA CR-112121, 1972. 

Sorensen, Harold W.; Parameter Estimation; Principles and Problems. Marcel Dtkker, Inc., New York, 1980. 

Strang, Glib* :: Linear Algebra and Its Applications. Academic Press, New York, 1980. 

Trankle, T L.; Vincent, J.H.; and Franklin, S. N.: System Identification of Nonlinear Aerodynamic Models. 
AGARDograph, The Techniques and Technology of Nonlinear Filtering and Kalman Filtering, 1982. 

Vaughan, David R.: A Nonrecursive Algebraic Solution for the Discrete Rlccatl Equation. IEEE Trans. Automat. 
Contr., Vol. AC-15, op. 597-599, 1970. 

Wlberg, Donald M.: State Space and Linear Systems. McGraw-Hill Book Co., New York, 1971. 

Wilkinson, J. H.: The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, 1965. 

Wilkinson, J. H. and Relnsch, C.: Handbook for Automatic Computation. Volume II. Linear Algebra, Part 2. 
Sprlnger-Verleg, New York, 1971. 

Wolowlcz, Chester H.: Considerations In the Determination of Stability and Control Derivatives and Dynamic 
Characteristics frm Flight Data. AGARD Rep. 549-Part 1, 1966. 

Wolowlcz, Chester H. and Holleman, Euclid C.: Stability-Derivative Determination from Flight Data. AGARD 

Report 224, 1958. 

Zacks, Shelenyahu: The Theory of Statistical Inference. John Wiley & Sons, New York, 1971. 

Zadeh, Lotfl A. and Desoer, Charles A.: Linear System Theory. McGraw-Hill Book Co., New York, 1963. 



